1 Introduction

This guide will cover the code needed for binning contigs into larger assemblies, and hopefully metagenomic assembled genomes (MAGs).

If you have not already, you should have quality controlled your reads to ensure no adapter content and no-low quality sequences. Follow my guide here which is aimed at amplicon metabarcoding but the same quality control applies to shotgun metagenomic reads: https://scottc-bio.github.io/guides/Metabarcoding-quality-control-with-FastQC.html. After this the reads needs to have been assembled into a coassembly, or multiple single assemblies which can be performed following my guide: https://scottc-bio.github.io/guides/Shotgun-metagenomic-coassembly.html. And then the reads for each sample need to be mapped back to the coassembly to allow filtering of contigs with insufficient coverage, which can be performed using the following guide: https://scottc-bio.github.io/guides/Shotgun-metagenomic-read-mapping.html.

Assuming that you have performed all these steps, you are now ready to bin your contigs. Multiple tools are available for this, I will be using MetaBAT2 (Kang et al. 2019).

2 What is binning?

When you assembled your contigs, reads with overlapping areas of sequence were assembled like a jigsaw into longer contigs with no inference to taxonomy. Mapping reads from each sample to the coassembly generated the data for coverage patterns of each contig across samples. When we bin contigs, the patterns of coverage across samples are used to suggest a relationship between contigs i.e. contigs with coverage values that fall and rise in the same samples likely come from the same organism. Binning also uses things like GC content of the contigs, tetra-nucleotide frequency, and maybe some other k-mer parameters depending on your assembly program.

3 Installing MetaBAT2

The simplest way to use MetaBAT2 is to use a conda environment, assuming you have conda on your system.

3.1 Installing conda

Miniconda is required for many bioinformatic processes and useful to have.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Enter ‘yes’ to both questions.

For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.

3.2 Installing MetaBAT2 with miniconda

Set up the necessary tools for your conda. The order of thes lines matters.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a new environment.

conda create -n metabat2_env metabat2

Activate the environment.

conda activate metabat2_env

Test the installation.

metabat2 --version
jgi_summarize_bam_contig_depths --help

Now it should be executable from any directory.

4 Binning with MetaBAT2

Navigate to the file with your filtered coassembly contigs.fa file e.g. ‘filtered_contigs.fa’ and the filtered read depth text file from the mapping of reads e.g. ‘filtered_depth.txt’.

The following command runs the binning process. Change the -i argument to match your filtered assembly .fa file name, and the -a argument to match your filtered assembly contig coverage data. Change the -o argument if you want to change the output directory name and the prefixes of the bins. Change the -t argument to assign more or less CPU cores depending on your system. And finally, change the –minContig argument depending on your filtering strictness for MAG binning.

metabat2 \
  -i filtered_contigs.fa \
  -a filtered_depth.txt \
  -o bins_dir/bin \
  -t 16 \
  --minContig 2000

4.1 Running binning in the background

I recommend using nohup to run this in the background so that you can disconnect from the VM or computing system.

nohup metabat2 \
  -i filtered_contigs.fa \
  -a filtered_depth.txt \
  -o bins_dir/bin \
  -t 16 \
  --minContig 2000 \
  > binning.log 2>&1 &

You can check the process is running using:

ps aux | grep metabat2

You should see the command you just run to search for processes using the term metabat2, and then additional processes which is the actual metabat2 command running.

You can also check that the binning.log file is being populated with the metabat2 output.

tail -f binning.log

Once complete, the log should tell you on the last few lines how many ‘bins’ were created, which are your theoretical MAGs, and the percentage of contigs binned successfully along with the number of bp this represents.

Each of you bins should be in the output directory you specified and each in its own separate .fa file. There should also be a file called something like ‘bin.BinInfo.txt’ which can be opened to see how many contigs went into which bin, the total length in bp of all contigs in the bin, and the average coverage of the bin weighted by contig length.

5 Checking MAG quality

The next step is to assess the quality of the MAGs for completeness and contamination using CheckM (Parks et al. 2015).

Make sure you have any conda environments deactivated, you may still be in the metabat2 environment.

conda deactivate

Create a new environment for the CheckM install.

conda create -n checkm_env -c bioconda -c conda-forge checkm-genome

Activate the environment.

conda activate checkm_env

Now run the CheckM command, the -x argument tells the command which directory the bins.fa files are located in. The output is stored in a directory called checkm_output, 16 cores are assigned, and 8 threads are used for the phylogenetic assignment.

checkm lineage_wf -x fa bins_dir checkm_output -t 16 --pplacer_threads 8

This command works using a curated database of single-copy marker genes known to occur in specific phylogenetic lineages e.g. Bacteria. For each bin file, CheckM infers the likely taxonomic lineage, selects the most appropriate set of marker genes, and then searches for these marker genes in the contigs of that bin. The % of expected marker genes in the bin gives an idea of completeness, the % of marker genes found multiple times suggests a % of contamination in the bin.

The output is a filed called ‘lineage.ms’ which is not human readable but can be converted to a .tsv file.

checkm qa checkm_output/lineage.ms checkm_output/ -f checkm_output/checkm_summary.tsv --tab_table -o 2

This .tsv can be opened in excel or a text editor to investigate. It has the following columns:

  • BinID - the bin number
  • Marker lineage - Closest taxonomic lineage bases on single copy marker genes, this taxonomy is not very specific and is just used for marker selection that allows CheckM to estimate Completeness and Contamination
  • # genomes - Number of references genomes used for the lineage marker set
  • # markers - Total number of marker genes in the set used to test completeness
  • # marker sets - Redundancy groups of markers as multiple markers sometimes track the same taxon
  • Completeness - Estimated % completeness based on occurence of marker genes in the selected set
  • Contamination - Estimated % contamination from “single-copy” genes appearing multiple times in the bin
  • Strain heterogeneity - Degree of variation in those duplicate markers which suggests strain mixing in the bin
  • Genome size (bp) - Total base pairs in the bin
  • # ambiguous bases - Count of ambiguous bases (N) in the bin
  • # scaffolds - Number of scaffolds (contigs with possible gaps). If you have followed all my guides on coassembly, mapping, and binning then you will probably only have contigs. This means all the values in the scaffold columns will be the same as in the contig columns.
  • # contigs - Number of contigs in the bin
  • N50 (scaffolds) - Half the genome is in scaffolds at least this long
  • N50 (contigs) - Half the genome is in contigs at least this long
  • Mean scaffold length (bp) - Average scaffold length
  • Mean contig length (bp) - Average contig length
  • Longest scaffold (bp) - Length of the longest scaffold
  • Longest contig (bp) - Length of the longest contig
  • GC - GC content % of the genome
  • GC std (scaffolds > 1kbp) - Standard deviation in GC content across larger scaffolds, which can indicate contamination if high
  • Coding density - The % of the genome predicted to be coding sequence
  • Translation table - Genetic code used for gene prediction
  • # predicted genes - Total number of genes predicted in the bin
  • 0, 1, 2, 3, 4, 5+ - These final columns show the number of marker sets where there were 0, 1, 2, 3, 4, 5+ copies of a marker found in a single bin where higher numbers in the higher number columns suggests more contamination

6 Quality-filtering MAGs

Theoretically each bin should contain contigs from a single organism and should be a MAG, even if not very assembled. In reality, contigs from more than one organism may have been included in the same bin. And therefore we can use completeness and contamination to quality filter.

A minimum of 90 % completeness and no more than 5 % contamination likely represents thresholds for high-quality MAGs. Depending on your research goals, for example if you just want to search for certain functions, then you can likely lower the completeness. Some publications use 50 % for medium-quality MAGs. But increasing the contamination threshold might be risky.

We can use the summary .tsv file to get the numbers of the bins that meet our criteria and copy them to a new directory.

Make a new directory for high_quality_bins.

mkdir high_quality_bins

Then extract the bin numbers that meet our criteria and copy these bin.fa files to the new directory.

awk -F'\t' 'NR>1 && $6 >= 90 && $7 <= 5 {print $1}' checkm_output/checkm_summary.tsv | while read bin; do cp "bins_dir/${bin}.fa" high_quality_bins/; done

If you want you can count the number of bins in this new directory.

ls high_quality_bins/*.fa | wc -l

Your bins are now ready for your downstream analysis.

7 References

Kang, Dongwan D., Feng Li, Edward Kirton, Ashleigh Thomas, Rob Egan, Hong An, and Zhong Wang. 2019. MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction from Metagenome Assemblies.” PeerJ 7: e7359. doi:10.7717/peerj.7359.
Parks, Donovan H., Michael Imelfort, Connor T. Skennerton, Philip Hugenholtz, and Gene W. Tyson. 2015. CheckM: Assessing the Quality of Microbial Genomes Recovered from Isolates, Single Cells, and Metagenomes.” Genome Research 25 (7): 1043–55. doi:10.1101/gr.186072.114.