This guide will cover shotgun metagenomic read quality assessment followed by mapping read data back to contigs in shotgun metagenomic experiments. This is an important step because the read data helps us to contextualise the quality of the contigs. It will allow us to identify low coverage contigs that may be possible contaminants and also contigs with ‘multi-modal’ coverage, a signifier of a chimeric contig e.g., a contig where DNA segments from different taxa have been incorrectly combined. Coverage info may also be important for any downstream applications, for example there are taxonomic classification tools that utilise coverage.

It will also allow us to assess the potential for improving the assembly or binning the contigs into metagenomes.

Finally, the read abundance gives us an idea of high abundance vs low abundance contigs (or genomes if binned), and this can give us an idea of community composition or even the abundance of functional gene sequences for a functional profile of the system.

A Linux based system will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html

A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.

Important note. It is assumed that the reads have already been quality trimmed to remove adapters and low-quality bases. For mapping to a contig it is very important that this pre-processing has been performed.

1 Installing required packages

We will be using bowtie2 (Langmead and Salzberg 2012) to map read data back to the contigs. Samtools will then be required to index the alignment to a format where we can calculate coverage (Li et al. 2009).

For Ubuntu systems:

sudo apt update
sudo apt install bowtie2 samtools

Or for a conda environment:

conda install -c bioconda bowtie2 samtools

2 Building a bowtie index

Move to a working directory where the ‘contigs.fa’ file and the ‘reads_R1.fastq.gz’ and ‘reads_R2.fastq.gz’ files are located. Your files will be named differently but should look similar.

Firstly, we will build index files for the contigs (essentially compressed information on the DNA fragments) with the following command:

bowtie2-build contigs.fa contigs_index

This will make files like ‘contigs_index.1.bt2’, ‘contigs_index.rev.2.bt2’ etc.

3 Mapping reads

Next we are ready to map the forward and reverse read files to the indexes generated from the contig. The ‘-p’ argument sets the number of cores assigned to the job, the more cores the faster theoretically it will run, but the number of cores available will depend on your machine. The ‘-S’ argument dictates the destination directory for the output .sam file. This step may take some time.

bowtie2 -x contigs_index -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -S coverage/contigs_aligned.sam --very-sensitive -p 8 --verbose

4 Converting SAM to BAM

Now we can convert and sort the SAM to a BAM file. Here we take the plain text .sam file containing all the alignments of reads to the contigs and convert it to a .bam binary format which is compressed and much smaller.

samtools view -bS contig_aligned.sam > contig_aligned.bam

Then we sort the .bam file to sort the alignments by contig name and genomic position.

samtools sort contig_aligned.bam -o contig_aligned.sorted.bam

And finally index the sorted .bam to create a .bai file that allows access to specific regions of the .bam. This step is required to calculate per contig coverage.

samtools index contig_aligned.sorted.bam

It is likely that you have multiple samples and want to repeat the process above multiple times. Read my guide on coassembly for an explanation of the distinction between the single assembly and coassembly approaches (https://scottc-bio.github.io/guides/Shotgun-metagenomic-coassembly.html).

5 Multiple single assembly automation

If you have chosen a single assembly approach you will have multiple samples each with their own contig file such as ‘sample1_contig.fa’, ‘sample2_contig.fa’, etc., and each with their own corresponding pair of forward and reverse fastq.gz read files e.g. ‘sample1_R1.fastq.gz’, ‘sample1_R2.fastq.gz’, ‘sample2_R1.fastq.gz’, ‘sample2_R2.fastq.gz’, etc. In this case it is possible to create a bash script which will loop through each contig file and its corresponding read files and complete the process described above.

Create a text file containing the following code and save as something like “map_and_sort.sh”:

#!/bin/bash

# Create output folder
mkdir -p coverage

# Loop through all .contigs.fa files
for contig_file in *.contigs.fa; do
    # Get the sample name (e.g., sample1 from sample1.contigs.fa)
    sample_name=$(basename "$contig_file" .contigs.fa)

    echo "Processing $sample_name..."

    # Build index
    bowtie2-build "$contig_file" "${sample_name}_index"

    # Map reads
    bowtie2 -x "${sample_name}_index" \
        -1 "${sample_name}_R1.fastq.gz" \
        -2 "${sample_name}_R2.fastq.gz" \
        -S "coverage/${sample_name}_aligned.sam" \
        --very-sensitive -p 8

    # Convert to sorted BAM
    samtools view -bS "coverage/${sample_name}_aligned.sam" | \
        samtools sort -o "coverage/${sample_name}_aligned.sorted.bam"

    # Index the BAM file
    samtools index "coverage/${sample_name}_aligned.sorted.bam"

    # Optional: remove large SAM file to save space
    rm "coverage/${sample_name}_aligned.sam"

    echo "Finished $sample_name"
done

5.1 A note on VMs and making bash scripts

If you have made this text file in a Word based processor then I recommend using Notepad++ because in the bottom right you can change “Windows (CR LF)” to “Unix (LF)”. This makes the file understandable for a Linux based system.

I also recommend using the freely available WinSCP software, which allows you to manage files on your VM with a GUI. Meaning you can drag and drop files from a local PC to your VM, including bash scripts like the one created above. If you create the .sh file using the WinSCP text editor, then it should already be in UNIX format.

If you make the .sh file in a text editor and move to the VM without changing to UNIX format, then you can instead use ‘dos2unix’ package to convert it

sudo apt install dos2unix
dos2unix run_mapping.sh

6 Multiple coassembly automation

If you have chosen a coassembly approach then you likely have a single coassembly contig file e.g. ‘contigs.fa’, and for each sample you have your forward and reverse reads files e.g. ‘sample1_R1.fastq.gz’, ‘sample1_R2.fastq.gz’, ‘sample2_R1.fastq.gz’, ‘sample2_R2.fastq.gz’. You can create a bash script that will loop through each set of read files, index them, map them to the single coassembly contig file, and then convert and sort into the .bam and .bai files.

You can repeat the bowtie2 indexing of the single contig as done before which should create some .bt2 files with the prefix ‘contig_index’.

Create a .sh file called ‘map_and_sort.sh’.

#!/bin/bash

# Set index prefix
INDEX="contig_index"

# Loop through all R1 read files
for R1 in *_R1.fastq.gz; do
    # Extract sample name (remove _R1.fastq.gz)
    SAMPLE=$(echo "$R1" | sed 's/_R1.fastq.gz//')
    R2="${SAMPLE}_R2.fastq.gz"

    # Check if matching R2 file exists
    if [[ ! -f "$R2" ]]; then
        echo "Missing R2 for sample: $SAMPLE — skipping"
        continue
    fi

    echo "Processing sample: $SAMPLE"

    # Bowtie2 mapping
    bowtie2 -x "$INDEX" -1 "$R1" -2 "$R2" -S "${SAMPLE}.sam" -p 16

    # Convert to BAM, sort, index
    samtools view -bS "${SAMPLE}.sam" > "${SAMPLE}.bam"
    samtools sort "${SAMPLE}.bam" -o "${SAMPLE}.sorted.bam"
    samtools index "${SAMPLE}.sorted.bam"

    # Clean up intermediate files
    rm "${SAMPLE}.sam"
done

7 Running the script for multiple samples

When the bash script is in UNIX format, and in your directory alongside your contig and read files (can actually modify the script to change contig and read file locations if located elsewhere), then make the script executable.

chmod +x map_and_sort.sh

And finally execute.

./map_and_sort.sh

If you want to run in the background, then a good option is to use nohup and store the output in another file.

nohup ./map_and_sort.sh > map_and_sort.log 2>&1 &

Now you can disconnect from the VM and the process will run in the background. If you want to check the progress you can re-connect and then check the end of the output log.

tail -f map_and_sort.log

7.1 Why map each sample individually?

If you have made a coassembly, then you may wonder why you are mapping reads individually for each sample to that coassembly. The answer is that you want to know average depth across samples and variation in coverage between samples, as these two measures are important for downstream applications.

8 Measuring coverage

Based on the mapping, we now can summarise the contig coverage per sample to provide important information that we can use for filtering.

We will be using MetaBAT2 (Kang et al. 2019) for this.

The simplest way to use MetaBAT2 is to use a conda environment, assuming you have conda on your system.

8.1 Installing conda

Miniconda is required for many bioinformatic processes and useful to have.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Enter ‘yes’ to both questions.

For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.

8.2 Installing MetaBAT2 with miniconda

Set up the necessary tools for your conda. The order of thes lines matters.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a new environment.

conda create -n metabat2_env metabat2

Activate the environment.

conda activate metabat2_env

Test the installation.

metabat2 --version
jgi_summarize_bam_contig_depths --help

8.3 Manual install of MetaBAT2

Download the MetaBAT2 binaries.

wget https://bitbucket.org/berkeleylab/metabat/get/v2.15.tar.gz

Extract the archive.

tar -xvzf v2.15.tar.gz

Now to make the MetaBAT2 binary accessible in any directory, you need to know the full path to the /bin directory from the MetaBAT2 download and add it to you path.

echo 'export PATH="/full/path/to/metabat2/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Now it should be executable from any directory.

8.4 Assessing coverage

The ‘jgi_summarize_bam_contig_depths’ command will extract the read depth information from all .sorted.bam files. If you took a single assembly approach, then this will have to be repeated for each samples reads and correcsponding .sorted.bam assembly file.

jgi_summarize_bam_contig_depths --outputDepth depth.txt *.sorted.bam

The console output of this command tells you for each sample the % of reads in each sample that are mapped successfully to the coassembly, or single assemblies, and is worth noting.

We can also investigate the top lines of the depth.txt file to confirm its structure is correct.

head depth.txt

The columns should be contigName, contigLen, TotalAvgDepth, and then for each sample (assuming coassembly) there will be a column for that sample’s average coverage for that contig and then that sample’s variance for that contig.

We can use this file to filter based on TotalAvgDepth and contigLen. If you want to assign taxonomy to the contigs then maybe filtering parameters of > 1000 bp, and a minimum of 2 x coverage on average are appropriate. If you want to bin contigs into metagenomes then more strict parameters are suggested.

9 Filtering contigs

Firstly, we can use a python script to filter based on contig length and total average coverage. Save the following code to a .py file called something like ‘filter_depths.py’.

The contigLen and totalAvgDepth parameters can be easily changed depending on your needs.

import pandas as pd
import os

# Create output directory
outdir = "filtered"
os.makedirs(outdir, exist_ok=True)

# Load depth file
df = pd.read_csv("depth.txt", sep="\t")

# Filter for taxonomic assignment
tax_filter = df[(df["contigLen"] >= 1000) & (df["totalAvgDepth"] >= 2)]
tax_file = os.path.join(outdir, "filtered_contigs.txt")
tax_filter.to_csv(tax_file, sep="\t", index=False)

# Summary
print("Total contigs in depth.txt:", len(df))
print("Contigs retained for taxonomic assignment:", len(tax_filter))

Execute this with python3.

python3 filter_depths.py

This will also print the contigs before filtering, and retained after filtering.

But this is just the filtered text file, and we want to make a fasta file of these filtered contig names which we can do using this text file and the .fa file of the final contigs from the megahit coassembly.

Can do this using seqtk, so install.

conda install -c bioconda seqtk

Firstly, let’s cut the contig IDs from the filtered depth file.

tail -n +2 filtered_contigs.txt | cut -f1 > filtered_ids.txt

Can inspect the top of this text file to make sure it is just a list of contig IDs.

head filtered_ids.txt

Then run the command to extract the fasta sequences to a new .fa file based on contig names from the filtered depth .txt file.

seqtk subseq contigs.fa filtered_ids.txt > filtered_contigs.fa

Now you have filtered the contigs and you are ready to proceed with the next steps of your analysis.

10 References

Kang, Dongwan D., Feng Li, Edward Kirton, Ashleigh Thomas, Rob Egan, Hong An, and Zhong Wang. 2019. MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction from Metagenome Assemblies.” PeerJ 7: e7359. doi:10.7717/peerj.7359.
Langmead, Ben, and Steven L. Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9 (4). Nature Publishing Group: 357–59. doi:10.1038/nmeth.1923.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map Format and SAMtools.” Bioinformatics 25 (16): 2078–79. doi:10.1093/bioinformatics/btp352.