This guide aims to cover the bioinformatics required to construct a coassembly of contigs from multiple read files. Depending on your experimental setup, you may have multiple sets of shotgun metagenomic reads from replicates of the same treatments. There are two options for assembling the reads into contigs; single assembly where reads from each replicate experiment are assembled into their own contigs, or coassembly where reads are pooled across all replicates and are assembled into a single set of contigs. For taxonomic identification, or for binning contigs into metagenome assembled genomes (MAGs), then coassembly is advisable as it will likely result in longer contigs with higher coverage.
This guide assumes a basic knowledge of LINUX and an understanding that high power computing resources such as an institutional high power computing service or a VM is required. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html
The first step is to merge read files, so navigate to the directory where for each replicate sample you should have two read files e.g. ‘sample1_R1.fastq.gz’ and ‘sample_R2.fastq.gz’. These are the forward (R1) and reverse (R2) reads from the shotgun sequencing of your sample.
Make a directory to store your merged reads.
mkdir coassembly_reads
Merge all the forward reads across multiple samples into a single fastq.gz file.
cat *_R1.fastq.gz > coassembly_reads/merged_R1.fastq.gz
And then repeat for the reverse reads.
cat *_R2.fastq.gz > coassembly_reads/merged_R2.fastq.gz
If you want to instead do single assemblies for each sample, then ignore the merging and perform the following sections for each set of reads separately.
There are various available software for coassembly of reads into contigs each with strengths and weaknesses. I have chosen to use MEGAHIT (Li et al. 2015) with is faster.
Conda is probably the easiest way to install with:
conda install -c bioconda megahit
But if you don’t have bioconda setup then you can install from the binary files as follows:
wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
Then extract:
tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
Now, a slightly complicated part is to make the command ‘megahit’ available at any location on your computer you need to open your bash profile.
nano ~/.bashrc
Scroll to the bottom using the arrow keys where the last few lines should all have paths to executable files and add another line with the path to your newly installed MEGAHIT. You will have to change the path depending on where you installed.
export PATH="$PATH:/home/MEGAHIT-1.2.9-Linux-x86_64-static/bin"
Then exit the bash profile with Ctrl + X, and then hit Y to save the changes.
Save and re-load the bash profile.
source ~/.bashrc
And then if you can bring up the help documentation from any directory location:
megahit --help
Then it is installed and executable from any location correctly.
Now the coassembly can be performed. The added argument ‘–presets meta-large’ is used when the biodiversity is expected to be high. If your study system is more simple than this can be removed. The ‘-t’ argument is how many cores assigned to the job which will depend on your computing system.
megahit -1 coassembly_reads/merged_R1.fastq.gz \
-2 coassembly_reads/merged_R2.fastq.gz \
-o megahit_coassembly \
--min-contig-len 1000 \
--presets meta-large \
-t 16
This step can take a long time, so it is advisable to run in the background using ‘nohup’.
nohup megahit -1 coassembly_reads/merged_R1.fastq.gz \
-2 coassembly_reads/merged_R2.fastq.gz \
-o megahit_coassembly \
--min-contig-len 1000 \
--presets meta-large \
-t 16 > megahit.log 2>&1 &
Then you can logout of the VM or disconnet and the process will still run. At any time you can re-connect and check the process is still running by searching for active processes with the term ‘megahit’:
ps aux | grep megahit
Or by checking th top CPU consuming jobs:
top
Use ‘q’ to exit viewing the top jobs.
You can also check the log file with:
tail -f megahit.log
This will update as the process progresses and the output is stored in this .log file.
Once complete, I recommend investigating the second to last line of the .log file which has the stored MEGAHIT output as it tells you about the coassembly with the following metrics: