Commercial metabarcoding services may process sequencing data through their own dedicated pipelines to generate OTU tables and taxonomic classifications that can be used for comparisons. However, if you want more control over the process then I would reccomend processing the raw sequencing results yourself.
Depending on the primer pair you have used, and the target region for metabarcoding (e.g. ITS1, ITS2, 18SV4, etc.), there are different options for bioinformatic pipelines available to process the raw sequencing data into meaningful OTU data with taxonomic classifications.
Whatever the pipeline, it is important to understand the quality of the sequencing results prior to processing. This way if anything goes wrong then the reason can be identified. Generally commercial metabarcoding services will guarantee a minimum quality and I have found that samples that return data never look perfect but are always fine for processing and to generate OTU data from as many of these pipelines deal with any quality issues.
Nevertheless, it is worthwhile to check the quality of the raw sequencing results to confirm that nothing clearly wrong has occurred. FastQC is a useful bioinformatic tool that can be used for each raw sequencing file and these results can be compiled into an easily interpretable report using MultiQC.
Assuming you have access to a high power computing system, a LINUX environment, or that you have set up a VM following my instructions (https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html).
Miniconda is required to create environments for a lot of bioinformatic processes and can be installed as follows:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Enter ‘yes’ to both questions.
For changes to register, will have to shut down the LINUX system or logout of a VM, and then restart or re-connect.
Set up an environment for the quality control of raw read sequencing files (.fastq.gz or .fastq) using FastQC to perform quality control (Andrews 2010), and assembling a quality control report for all files using MultiQC (Ewels et al. 2016).
Create the conda environment based on python 3.8, and call it ‘qc_env’.
conda create -n qc_env python=3.8
Activate the environment.
conda activate qc_env
Install FastQC and MultiQC.
conda install -c bioconda fastqc
conda install -c bioconda -c conda-forge multiqc
If you are using a cloud computing service like a VM, then you will need to logout of the VM and transfer the raw read sequencing files in fastq.gz format to the VM using the Windows Powershell or MacOS/LINUX Terminal.
First, compress all raw read sequencing files to a tar.gz file. Notice the backslashes used on the windows command line compared to forward slashes on the LINUX command line of a VM.
tar -czvf rawdata.tar.gz -C "C:\path\to\directory\containing\raw\sequencing\files\*fastq.gz"
Transfer the compressed rawdata file to the VM, will have to enter the VM password for the transfer.
scp "C:\path\to\directory\containing\compressed\file\rawdata.tar.gz" root@ipforyourvm:~
Then connect to the VM as before, and make a directory for to perform the analysis in and a rawdata directory below it.
mkdir process
mkdir process/rawdata
Transfer the .tar.gz file to the new directory.
mv rawdata.tar.gz process
Move into the process/ directory and unzip the compressed rawdata file into the rawdata directory.
cd process
tar -xvzf rawdata.tar.gz -C rawdata/
Firstly activate the FastQC environment prepared earlier and then make a directory for the output.
conda activate qc_env
mkdir qc_out
Then can utilise the list or ‘ls’ function to run FastQC on 4 files at once using the following command:
ls rawdata/*fastq.gz | xargs -n 1 -P 4 fastqc -o qc_out
When complete, can assemble all FastQC outputs into a single report using MultiQC. Firstly, make a directory for the MultiQC output and then run MultiQC.
mkdir multiqc_out
multiqc qc_out -o multiqc_out
Transfer back to local machine for inspection.
logout
scp root@ipforyourvm:~/process/multiqc_out/multiqc_report.html "C:\path\to\your\desired\directory\location"
The following quality control measures are visible in the assembled MultiQC report and here is my interpretation in the context of amplicon metabarcoding of environmental DNA.
The most important thing is to check the sequence quality scores and sequence lengths to make sure nothing has gone obviously wrong with the sequencing. All the other issues such as duplication and overrepresentation are dealt with during the processing stage.
Once checked, you are now ready to proceed to processing amplicon sequence files through a dedicated pipeline!