This guide will cover shotgun metagenomic read quality assessment followed by mapping read data back to contigs in shotgun metagenomic experiments. This is an important step because the read data helps us to contextualise the quality of the contigs. It will allow us to identify low coverage contigs that may be possible contaminants and also contigs with ‘multi-modal’ coverage, a signifier of a chimeric contig e.g., a contig where DNA segments from different taxa have been incorrectly combined. Coverage info may also be important for any downstream applications, for example there are taxonomic classification tools that utilise coverage.
It will also allow us to assess the potential for improving the assembly or binning the contigs into metagenomes.
Finally, the read abundance gives us an idea of high abundance vs low abundance contigs (or genomes if binned), and this can give us an idea of community composition or even the abundance of functional gene sequences for a functional profile of the system.
A Linux based system will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html
A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.
Important note. It is assumed that the reads have already been quality trimmed to remove adapters and low-quality bases. For mapping to a contig it is very important that this pre-processing has been performed.
We will be using bowtie2 (Langmead and Salzberg 2012) to map read data back to the contigs. Samtools will then be required to index the alignment to a format where we can calculate coverage (Li et al. 2009).
For Ubuntu systems:
sudo apt update
sudo apt install bowtie2 samtools
Or for a conda environment:
conda install -c bioconda bowtie2 samtools
Move to a working directory where the ‘contigs.fa’ file and the ‘reads_R1.fastq.gz’ and ‘reads_R2.fastq.gz’ files are located. Your files will be named differently but should look similar.
Firstly, we will build index files for the contigs (essentially compressed information on the DNA fragments) with the following command:
bowtie2-build contigs.fa contigs_index
This will make files like ‘contigs_index.1.bt2’, ‘contigs_index.rev.2.bt2’ etc.
Next we are ready to map the forward and reverse read files to the indexes generated from the contig. The ‘-p’ argument sets the number of cores assigned to the job, the more cores the faster theoretically it will run, but the number of cores available will depend on your machine. The ‘-S’ argument dictates the destination directory for the output .sam file. This step may take some time.
bowtie2 -x contigs_index -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -S coverage/contigs_aligned.sam --very-sensitive -p 8 --verbose
Now we can convert and sort the SAM to a BAM file. Here we take the plain text .sam file containing all the alignments of reads to the contigs and convert it to a .bam binary format which is compressed and much smaller.
samtools view -bS contig_aligned.sam > contig_aligned.bam
Then we sort the .bam file to sort the alignments by contig name and genomic position.
samtools sort contig_aligned.bam -o contig_aligned.sorted.bam
And finally index the sorted .bam to create a .bai file that allows access to specific regions of the .bam. This step is required to calculate per contig coverage.
samtools index contig_aligned.sorted.bam
It is likely that you have multiple samples and want to repeat the process above multiple times. Read my guide on coassembly for an explanation of the distinction between the single assembly and coassembly approaches (https://scottc-bio.github.io/guides/Shotgun-metagenomic-coassembly.html).
If you have chosen a single assembly approach you will have multiple samples each with their own contig file such as ‘sample1_contig.fa’, ‘sample2_contig.fa’, etc., and each with their own corresponding pair of forward and reverse fastq.gz read files e.g. ‘sample1_R1.fastq.gz’, ‘sample1_R2.fastq.gz’, ‘sample2_R1.fastq.gz’, ‘sample2_R2.fastq.gz’, etc. In this case it is possible to create a bash script which will loop through each contig file and its corresponding read files and complete the process described above.
Create a text file containing the following code and save as something like “map_and_sort.sh”:
#!/bin/bash
# Create output folder
mkdir -p coverage
# Loop through all .contigs.fa files
for contig_file in *.contigs.fa; do
# Get the sample name (e.g., sample1 from sample1.contigs.fa)
sample_name=$(basename "$contig_file" .contigs.fa)
echo "Processing $sample_name..."
# Build index
bowtie2-build "$contig_file" "${sample_name}_index"
# Map reads
bowtie2 -x "${sample_name}_index" \
-1 "${sample_name}_R1.fastq.gz" \
-2 "${sample_name}_R2.fastq.gz" \
-S "coverage/${sample_name}_aligned.sam" \
--very-sensitive -p 8
# Convert to sorted BAM
samtools view -bS "coverage/${sample_name}_aligned.sam" | \
samtools sort -o "coverage/${sample_name}_aligned.sorted.bam"
# Index the BAM file
samtools index "coverage/${sample_name}_aligned.sorted.bam"
# Optional: remove large SAM file to save space
rm "coverage/${sample_name}_aligned.sam"
echo "Finished $sample_name"
done
If you have made this text file in a Word based processor then I recommend using Notepad++ because in the bottom right you can change “Windows (CR LF)” to “Unix (LF)”. This makes the file understandable for a Linux based system.
I also recommend using the freely available WinSCP software, which allows you to manage files on your VM with a GUI. Meaning you can drag and drop files from a local PC to your VM, including bash scripts like the one created above. If you create the .sh file using the WinSCP text editor, then it should already be in UNIX format.
If you make the .sh file in a text editor and move to the VM without changing to UNIX format, then you can instead use ‘dos2unix’ package to convert it
sudo apt install dos2unix
dos2unix run_mapping.sh
If you have chosen a coassembly approach then you likely have a single coassembly contig file e.g. ‘contigs.fa’, and for each sample you have your forward and reverse reads files e.g. ‘sample1_R1.fastq.gz’, ‘sample1_R2.fastq.gz’, ‘sample2_R1.fastq.gz’, ‘sample2_R2.fastq.gz’. You can create a bash script that will loop through each set of read files, index them, map them to the single coassembly contig file, and then convert and sort into the .bam and .bai files.
You can repeat the bowtie2 indexing of the single contig as done before which should create some .bt2 files with the prefix ‘contig_index’.
Create a .sh file called ‘map_and_sort.sh’.
#!/bin/bash
# Set index prefix
INDEX="contig_index"
# Loop through all R1 read files
for R1 in *_R1.fastq.gz; do
# Extract sample name (remove _R1.fastq.gz)
SAMPLE=$(echo "$R1" | sed 's/_R1.fastq.gz//')
R2="${SAMPLE}_R2.fastq.gz"
# Check if matching R2 file exists
if [[ ! -f "$R2" ]]; then
echo "Missing R2 for sample: $SAMPLE — skipping"
continue
fi
echo "Processing sample: $SAMPLE"
# Bowtie2 mapping
bowtie2 -x "$INDEX" -1 "$R1" -2 "$R2" -S "${SAMPLE}.sam" -p 16
# Convert to BAM, sort, index
samtools view -bS "${SAMPLE}.sam" > "${SAMPLE}.bam"
samtools sort "${SAMPLE}.bam" -o "${SAMPLE}.sorted.bam"
samtools index "${SAMPLE}.sorted.bam"
# Clean up intermediate files
rm "${SAMPLE}.sam"
done
When the bash script is in UNIX format, and in your directory alongside your contig and read files (can actually modify the script to change contig and read file locations if located elsewhere), then make the script executable.
chmod +x map_and_sort.sh
And finally execute.
./map_and_sort.sh
If you want to run in the background, then a good option is to use nohup and store the output in another file.
nohup ./map_and_sort.sh > map_and_sort.log 2>&1 &
Now you can disconnect from the VM and the process will run in the background. If you want to check the progress you can re-connect and then check the end of the output log.
tail -f map_and_sort.log
If you have made a coassembly, then you may wonder why you are mapping reads individually for each sample to that coassembly. The answer is that you want to know average depth across samples and variation in coverage between samples, as these two measures are important for downstream applications.
Based on the mapping, we now can summarise the contig coverage per sample to provide important information that we can use for filtering.
We will be using MetaBAT2 (Kang et al. 2019) for this.
The simplest way to use MetaBAT2 is to use a conda environment, assuming you have conda on your system.
Miniconda is required for many bioinformatic processes and useful to have.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Enter ‘yes’ to both questions.
For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.
Set up the necessary tools for your conda. The order of thes lines matters.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Create a new environment.
conda create -n metabat2_env metabat2
Activate the environment.
conda activate metabat2_env
Test the installation.
metabat2 --version
jgi_summarize_bam_contig_depths --help
Download the MetaBAT2 binaries.
wget https://bitbucket.org/berkeleylab/metabat/get/v2.15.tar.gz
Extract the archive.
tar -xvzf v2.15.tar.gz
Now to make the MetaBAT2 binary accessible in any directory, you need to know the full path to the /bin directory from the MetaBAT2 download and add it to you path.
echo 'export PATH="/full/path/to/metabat2/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
Now it should be executable from any directory.
The ‘jgi_summarize_bam_contig_depths’ command will extract the read depth information from all .sorted.bam files. If you took a single assembly approach, then this will have to be repeated for each samples reads and correcsponding .sorted.bam assembly file.
jgi_summarize_bam_contig_depths --outputDepth depth.txt *.sorted.bam
The console output of this command tells you for each sample the % of reads in each sample that are mapped successfully to the coassembly, or single assemblies, and is worth noting.
We can also investigate the top lines of the depth.txt file to confirm its structure is correct.
head depth.txt
The columns should be contigName, contigLen, TotalAvgDepth, and then for each sample (assuming coassembly) there will be a column for that sample’s average coverage for that contig and then that sample’s variance for that contig.
We can use this file to filter based on TotalAvgDepth and contigLen. If you want to assign taxonomy to the contigs then maybe filtering parameters of > 1000 bp, and a minimum of 2 x coverage on average are appropriate. If you want to bin contigs into metagenomes then more strict parameters are suggested.
Firstly, we can use a python script to filter based on contig length and total average coverage. Save the following code to a .py file called something like ‘filter_depths.py’.
The contigLen and totalAvgDepth parameters can be easily changed depending on your needs.
import pandas as pd
import os
# Create output directory
outdir = "filtered"
os.makedirs(outdir, exist_ok=True)
# Load depth file
df = pd.read_csv("depth.txt", sep="\t")
# Filter for taxonomic assignment
tax_filter = df[(df["contigLen"] >= 1000) & (df["totalAvgDepth"] >= 2)]
tax_file = os.path.join(outdir, "filtered_contigs.txt")
tax_filter.to_csv(tax_file, sep="\t", index=False)
# Summary
print("Total contigs in depth.txt:", len(df))
print("Contigs retained for taxonomic assignment:", len(tax_filter))
Execute this with python3.
python3 filter_depths.py
This will also print the contigs before filtering, and retained after filtering.
But this is just the filtered text file, and we want to make a fasta file of these filtered contig names which we can do using this text file and the .fa file of the final contigs from the megahit coassembly.
Can do this using seqtk, so install.
conda install -c bioconda seqtk
Firstly, let’s cut the contig IDs from the filtered depth file.
tail -n +2 filtered_contigs.txt | cut -f1 > filtered_ids.txt
Can inspect the top of this text file to make sure it is just a list of contig IDs.
head filtered_ids.txt
Then run the command to extract the fasta sequences to a new .fa file based on contig names from the filtered depth .txt file.
seqtk subseq contigs.fa filtered_ids.txt > filtered_contigs.fa
Now you have filtered the contigs and you are ready to proceed with the next steps of your analysis.