This guide aims to cover the bioinformatics required to construct a coassembly of contigs from multiple read files. Depending on your experimental setup, you may have multiple sets of shotgun metagenomic reads from replicates of the same treatments. There are two options for assembling the reads into contigs; single assembly where reads from each replicate experiment are assembled into their own contigs, or coassembly where reads are pooled across all replicates and are assembled into a single set of contigs. For taxonomic identification, or for binning contigs into metagenome assembled genomes (MAGs), then coassembly is advisable as it will likely result in longer contigs with higher coverage.

This guide assumes a basic knowledge of LINUX and an understanding that high power computing resources such as an institutional high power computing service or a VM is required. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html

1 Merging read files

The first step is to merge read files, so navigate to the directory where for each replicate sample you should have two read files e.g. ‘sample1_R1.fastq.gz’ and ‘sample_R2.fastq.gz’. These are the forward (R1) and reverse (R2) reads from the shotgun sequencing of your sample.

Make a directory to store your merged reads.

mkdir coassembly_reads

Merge all the forward reads across multiple samples into a single fastq.gz file.

cat *_R1.fastq.gz > coassembly_reads/merged_R1.fastq.gz

And then repeat for the reverse reads.

cat *_R2.fastq.gz > coassembly_reads/merged_R2.fastq.gz

If you want to instead do single assemblies for each sample, then ignore the merging and perform the following sections for each set of reads separately.

2 Performing the coassembly

There are various available software for coassembly of reads into contigs each with strengths and weaknesses. I have chosen to use MEGAHIT (Li et al. 2015) with is faster.

2.1 Installing MEGAHIT

Conda is probably the easiest way to install with:

conda install -c bioconda megahit

But if you don’t have bioconda setup then you can install from the binary files as follows:

wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz

Then extract:

tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz

Now, a slightly complicated part is to make the command ‘megahit’ available at any location on your computer you need to open your bash profile.

nano ~/.bashrc

Scroll to the bottom using the arrow keys where the last few lines should all have paths to executable files and add another line with the path to your newly installed MEGAHIT. You will have to change the path depending on where you installed.

export PATH="$PATH:/home/MEGAHIT-1.2.9-Linux-x86_64-static/bin"

Then exit the bash profile with Ctrl + X, and then hit Y to save the changes.

Save and re-load the bash profile.

source ~/.bashrc

And then if you can bring up the help documentation from any directory location:

megahit --help

Then it is installed and executable from any location correctly.

2.2 MEGAHIT coassembly

Now the coassembly can be performed. The added argument ‘–presets meta-large’ is used when the biodiversity is expected to be high. If your study system is more simple than this can be removed. The ‘-t’ argument is how many cores assigned to the job which will depend on your computing system.

megahit -1 coassembly_reads/merged_R1.fastq.gz \
        -2 coassembly_reads/merged_R2.fastq.gz \
        -o megahit_coassembly \
        --min-contig-len 1000 \
        --presets meta-large \
        -t 16

This step can take a long time, so it is advisable to run in the background using ‘nohup’.

nohup megahit -1 coassembly_reads/merged_R1.fastq.gz \
        -2 coassembly_reads/merged_R2.fastq.gz \
        -o megahit_coassembly \
        --min-contig-len 1000 \
        --presets meta-large \
        -t 16 > megahit.log 2>&1 &

Then you can logout of the VM or disconnet and the process will still run. At any time you can re-connect and check the process is still running by searching for active processes with the term ‘megahit’:

ps aux | grep megahit

Or by checking th top CPU consuming jobs:

top

Use ‘q’ to exit viewing the top jobs.

You can also check the log file with:

tail -f megahit.log

This will update as the process progresses and the output is stored in this .log file.

Once complete, I recommend investigating the second to last line of the .log file which has the stored MEGAHIT output as it tells you about the coassembly with the following metrics:

  • Number of contigs - Higher numbers suggest a more diverse system.
  • Total bp - The total length of all contigs added together. Bacterial genomes are roughly 3 - 6 Mbp. So a much higher value than this suggests unassembled genomes but this is typical at this stage.
  • Min bp - Length of the smallest contig which will be influence by the parameter you set for minimum contig length in the MEGAHIT command.
  • Max bp - Length of longest contig, values > 1 Mbp might suggest whole chromosomes or even whole organisms.
  • Avg bp - The average length of a contig, but there are likely many more short contigs than there are long contigs so this is probably skewed.
  • N50 bp - Defined as “the length of the shortest contig within the set of contigs that when combined make up 50 % of the total assembly length. Typical N50 values for bacterial metagenomes are between 5 - 20 Mbp.

3 References

Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. 2015. MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph.” Bioinformatics 31 (10): 1674–76. doi:10.1093/bioinformatics/btv033.