1 What is quality control?

Commercial metabarcoding services may process sequencing data through their own dedicated pipelines to generate OTU tables and taxonomic classifications that can be used for comparisons. However, if you want more control over the process then I would reccomend processing the raw sequencing results yourself.

Depending on the primer pair you have used, and the target region for metabarcoding (e.g. ITS1, ITS2, 18SV4, etc.), there are different options for bioinformatic pipelines available to process the raw sequencing data into meaningful OTU data with taxonomic classifications.

Whatever the pipeline, it is important to understand the quality of the sequencing results prior to processing. This way if anything goes wrong then the reason can be identified. Generally commercial metabarcoding services will guarantee a minimum quality and I have found that samples that return data never look perfect but are always fine for processing and to generate OTU data from as many of these pipelines deal with any quality issues.

Nevertheless, it is worthwhile to check the quality of the raw sequencing results to confirm that nothing clearly wrong has occurred. FastQC is a useful bioinformatic tool that can be used for each raw sequencing file and these results can be compiled into an easily interpretable report using MultiQC.

2 Preparing a quality control conda environment

Assuming you have access to a high power computing system, a LINUX environment, or that you have set up a VM following my instructions (https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html).

Miniconda is required to create environments for a lot of bioinformatic processes and can be installed as follows:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Enter ‘yes’ to both questions.

For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.

Set up an environment for the quality control of raw read sequencing files (.fastq.gz or .fastq) using FastQC to perform quality control (Andrews 2010), and assembling a quality control report for all files using MultiQC (Ewels et al. 2016).

Create the conda environment based on python 3.8, and call it ‘qc_env’.

conda create -n qc_env python=3.8

Activate the environment.

conda activate qc_env

Install FastQC and MultiQC.

conda install -c bioconda fastqc
conda install -c bioconda -c conda-forge multiqc

3 Transferring files to a VM

If you are using a cloud computing service like a VM, then you will need to logout of the VM and transfer the raw read sequencing files in fastq.gz format to the VM using the Windows Powershell or MacOS/LINUX Terminal.

3.1 Compression

First, compress all raw read sequencing files to a tar.gz file. Notice the backslashes used on the windows command line compared to forward slashes on the LINUX command line of a VM.

tar -czvf rawdata.tar.gz -C "C:\path\to\directory\containing\raw\sequencing\files\*fastq.gz"

3.2 Transfer

Transfer the compressed rawdata file to the VM, will have to enter the VM password for the transfer.

scp "C:\path\to\directory\containing\compressed\file\rawdata.tar.gz" root@ipforyourvm:~

Then connect to the VM as before, and make a directory for to perform the analysis in and a rawdata directory below it.

mkdir process
mkdir process/rawdata

Transfer the .tar.gz file to the new directory.

mv rawdata.tar.gz process

Move into the process/ directory and unzip the compressed rawdata file into the rawdata directory.

cd process
tar -xvzf rawdata.tar.gz -C rawdata/

4 Quality control

4.1 FastQC

Firstly activate the FastQC environment prepared earlier and then make a directory for the output.

conda activate qc_env
mkdir qc_out

Then can utilise the list or ‘ls’ function to run FastQC on 4 files at once using the following command:

ls rawdata/*fastq.gz | xargs -n 1 -P 4 fastqc -o qc_out

4.2 MultiQC

When complete, can assemble all FastQC outputs into a single report using MultiQC. Firstly, make a directory for the MultiQC output and then run MultiQC.

mkdir multiqc_out
multiqc qc_out -o multiqc_out

Transfer back to local machine for inspection.

logout
scp root@ipforyourvm:~/process/multiqc_out/multiqc_report.html "C:\path\to\your\desired\directory\location"

5 Interpreting FastQC results in the MultiQC output

The following quality control measures are visible in the assembled MultiQC report and here is my interpretation in the context of amplicon metabarcoding of environmental DNA.

  • Sequence counts: Essentially the number of reads the sequencing machine made, also shows the proportion of unique to duplicate reads, and in metabarcoding of environmental samples the duplication levels are expected to be high.
  • Sequence quality histograms: At each base pair position, shows the average quality score across all reads for each sample. This is probably the most important check, as poor sequence quality indicates something has gone wrong with sample prep or sequencing and cannot be addressed now. I have never had a sample fail from a commercial provider.
  • Per sequence quality scores: Another measure of sequence quality which is the most important thing to check.
  • Per base content: Shows at each position, the average base that occurs at that position with rows as samples. In amplicon metabarcoding would not expect an even distribution as amplicons are similar in some regions and variable in others. Also from environmental samples there tends to be dominance of certain taxa and not an even distribution. So sequencing results would be expected to fail this check.
  • Per sequence GC content: Due to the nature of environmental metabarcoding, would not expect an even distribution here and therefore would expect samples to fail. However, useful to check that you don’t have really high or low GC contents that might signify a sequencing error.
  • Per base N content: Ns are called when the actual base cannot be confidently determined. Would expect this to be very low for all samples if sequencing has worked correctly.
  • Sequence length distribution: As we are targeting amplicons of a specific length and sequencers are setup to usually read a specific bp length, would not expect an even distribution so are likely to get a warning. But want to check that almost all reads are in the expected length range for the targeted amplicon.
  • Sequence duplication levels: Will likely fail as most samples will have very high duplication levels as sequencing depth has captured multiple reads with the same sequences and often some taxa dominate microbial communities.
  • Overrepresented sequences: For the same reason above, would expect to see high levels of overrepresentation.
  • Adapter content: Adapters are used in the sequencing reaction but are often automatically removed by sequencing platform algorithms so likely depends on your sequencing provider. Many pipelines will use CutAdapt (Martin 2011) to remove any persistent adapter sequences, or can be used prior to processing.

The most important thing is to check the sequence quality scores and sequence lengths to make sure nothing has gone obviously wrong with the sequencing. All the other issues such as duplication and overrepresentation are dealt with during the processing stage.

Once checked, you are now ready to proceed to processing amplicon sequence files through a dedicated pipeline!

References

Andrews, Simon. 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data – ScienceOpen.” https://www.scienceopen.com/document?vid=de674375-ab83-4595-afa9-4c8aa9e4e736.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics 32 (19): 3047–48. doi:10.1093/bioinformatics/btw354.
Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet.journal 17 (1): 10–12. doi:10.14806/ej.17.1.200.