This guide will cover the code needed for binning contigs into larger assemblies, and hopefully metagenomic assembled genomes (MAGs).
If you have not already, you should have quality controlled your reads to ensure no adapter content and no-low quality sequences. Follow my guide here which is aimed at amplicon metabarcoding but the same quality control applies to shotgun metagenomic reads: https://scottc-bio.github.io/guides/Metabarcoding-quality-control-with-FastQC.html. After this the reads needs to have been assembled into a coassembly, or multiple single assemblies which can be performed following my guide: https://scottc-bio.github.io/guides/Shotgun-metagenomic-coassembly.html. And then the reads for each sample need to be mapped back to the coassembly to allow filtering of contigs with insufficient coverage, which can be performed using the following guide: https://scottc-bio.github.io/guides/Shotgun-metagenomic-read-mapping.html.
Assuming that you have performed all these steps, you are now ready to bin your contigs. Multiple tools are available for this, I will be using MetaBAT2 (Kang et al. 2019).
When you assembled your contigs, reads with overlapping areas of sequence were assembled like a jigsaw into longer contigs with no inference to taxonomy. Mapping reads from each sample to the coassembly generated the data for coverage patterns of each contig across samples. When we bin contigs, the patterns of coverage across samples are used to suggest a relationship between contigs i.e. contigs with coverage values that fall and rise in the same samples likely come from the same organism. Binning also uses things like GC content of the contigs, tetra-nucleotide frequency, and maybe some other k-mer parameters depending on your assembly program.
The simplest way to use MetaBAT2 is to use a conda environment, assuming you have conda on your system.
Miniconda is required for many bioinformatic processes and useful to have.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Enter ‘yes’ to both questions.
For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.
Set up the necessary tools for your conda. The order of thes lines matters.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Create a new environment.
conda create -n metabat2_env metabat2
Activate the environment.
conda activate metabat2_env
Test the installation.
metabat2 --version
jgi_summarize_bam_contig_depths --help
Now it should be executable from any directory.
Navigate to the file with your filtered coassembly contigs.fa file e.g. ‘filtered_contigs.fa’ and the filtered read depth text file from the mapping of reads e.g. ‘filtered_depth.txt’.
The following command runs the binning process. Change the -i argument to match your filtered assembly .fa file name, and the -a argument to match your filtered assembly contig coverage data. Change the -o argument if you want to change the output directory name and the prefixes of the bins. Change the -t argument to assign more or less CPU cores depending on your system. And finally, change the –minContig argument depending on your filtering strictness for MAG binning.
metabat2 \
-i filtered_contigs.fa \
-a filtered_depth.txt \
-o bins_dir/bin \
-t 16 \
--minContig 2000
I recommend using nohup to run this in the background so that you can disconnect from the VM or computing system.
nohup metabat2 \
-i filtered_contigs.fa \
-a filtered_depth.txt \
-o bins_dir/bin \
-t 16 \
--minContig 2000 \
> binning.log 2>&1 &
You can check the process is running using:
ps aux | grep metabat2
You should see the command you just run to search for processes using the term metabat2, and then additional processes which is the actual metabat2 command running.
You can also check that the binning.log file is being populated with the metabat2 output.
tail -f binning.log
Once complete, the log should tell you on the last few lines how many ‘bins’ were created, which are your theoretical MAGs, and the percentage of contigs binned successfully along with the number of bp this represents.
Each of you bins should be in the output directory you specified and each in its own separate .fa file. There should also be a file called something like ‘bin.BinInfo.txt’ which can be opened to see how many contigs went into which bin, the total length in bp of all contigs in the bin, and the average coverage of the bin weighted by contig length.
The next step is to assess the quality of the MAGs for completeness and contamination using CheckM (Parks et al. 2015).
Make sure you have any conda environments deactivated, you may still be in the metabat2 environment.
conda deactivate
Create a new environment for the CheckM install.
conda create -n checkm_env -c bioconda -c conda-forge checkm-genome
Activate the environment.
conda activate checkm_env
Now run the CheckM command, the -x argument tells the command which directory the bins.fa files are located in. The output is stored in a directory called checkm_output, 16 cores are assigned, and 8 threads are used for the phylogenetic assignment.
checkm lineage_wf -x fa bins_dir checkm_output -t 16 --pplacer_threads 8
This command works using a curated database of single-copy marker genes known to occur in specific phylogenetic lineages e.g. Bacteria. For each bin file, CheckM infers the likely taxonomic lineage, selects the most appropriate set of marker genes, and then searches for these marker genes in the contigs of that bin. The % of expected marker genes in the bin gives an idea of completeness, the % of marker genes found multiple times suggests a % of contamination in the bin.
The output is a filed called ‘lineage.ms’ which is not human readable but can be converted to a .tsv file.
checkm qa checkm_output/lineage.ms checkm_output/ -f checkm_output/checkm_summary.tsv --tab_table -o 2
This .tsv can be opened in excel or a text editor to investigate. It has the following columns:
Theoretically each bin should contain contigs from a single organism and should be a MAG, even if not very assembled. In reality, contigs from more than one organism may have been included in the same bin. And therefore we can use completeness and contamination to quality filter.
A minimum of 90 % completeness and no more than 5 % contamination likely represents thresholds for high-quality MAGs. Depending on your research goals, for example if you just want to search for certain functions, then you can likely lower the completeness. Some publications use 50 % for medium-quality MAGs. But increasing the contamination threshold might be risky.
We can use the summary .tsv file to get the numbers of the bins that meet our criteria and copy them to a new directory.
Make a new directory for high_quality_bins.
mkdir high_quality_bins
Then extract the bin numbers that meet our criteria and copy these bin.fa files to the new directory.
awk -F'\t' 'NR>1 && $6 >= 90 && $7 <= 5 {print $1}' checkm_output/checkm_summary.tsv | while read bin; do cp "bins_dir/${bin}.fa" high_quality_bins/; done
If you want you can count the number of bins in this new directory.
ls high_quality_bins/*.fa | wc -l
Your bins are now ready for your downstream analysis.