For fungal community experiments, the ITS region is a popular amplicon target for metabarcoding. The PIPITS pipeline is specifically designed to process ITS1 or ITS2 region amplicon data for fungal metabarcoding experiments and is a very simple pipeline to both install and run.

I would very much recommend visitng the PIPITS Github page: https://github.com/hsgweon/pipits And also reading the paper on PIPITS: https://doi.org/10.1111/2041-210X.12399

The GitHub page is extremely helpful and explains each step in detail. It also highlights how a powerful computing system will be required to run PIPITS with at least 16 Gb memory. Here I summarise the steps for installing and running PIPITS on a high performance computing system or a VM.

It is strongly advised that the quality of sequencing results is investigated prior to any amplicon sequence processing. I have a guide on how to use FastQC for assessing the quality of amplicon sequencing results here: https://scottc-bio.github.io/guides/Metabarcoding-quality-control-with-FastQC.html

1 Installing PIPITS

Log in to your high power computing platform, or set up a cloud VM (https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html).

Create a PIPITS conda environment.

conda create -n pipits_env --channel bioconda --channel conda-forge python=3.10 pipits hmmer

The environment is now ready

2 Transferring files to a VM

If you are using a cloud computing service like a VM, then you will need to logout of the VM and transfer the raw read sequencing files in fastq.gz format to the VM using the Windows Powershell or MacOS/LINUX Terminal.

2.1 Compression

First, compress all raw read sequencing files to a tar.gz file. Notice the backslashes used on the windows command line compared to forward slashes on the LINUX command line of a VM.

tar -czvf rawdata.tar.gz -C "C:\path\to\directory\containing\raw\sequencing\files\*fastq.gz"

2.2 Transfer

Transfer the compressed rawdata file to the VM, will have to enter the VM password for the transfer.

scp "C:\path\to\directory\containing\compressed\file\rawdata.tar.gz" root@ipforyourvm:~

Then connect to the VM as before, and make a directory for to perform the analysis in and a rawdata directory below it.

mkdir process
mkdir process/rawdata

Transfer the .tar.gz file to the new directory.

mv rawdata.tar.gz process

Move into the process/ directory and unzip the compressed rawdata file into the rawdata directory.

cd process
tar -xvzf rawdata.tar.gz -C rawdata/

3 Running PIPITS

Activate the pipits environment prepared earlier.

conda activate pipits_env

Create a list of read pairs from the paired raw read files.

pispino_createreadpairslist -i rawdata/ -o readpairslist.txt

Prep the sequences for processing.

pispino_seqprep -i rawdata/ -o out_seqprep -l readpairslist.txt

The next step is to extract the ITS regions from the read data, and this is the most computationally intensive step of the PIPITS pipeline. Using multiple CPUs will massively speed up this process. The ‘nohup’ command will also be used to run the process in the background of the VM without reliance on a connection from a local machine, i.e. can logout of the VM and the process will continue to run. For context, 480 raw read files utilising a single CPU took 7 days to run, but with 6 CPUs it takes less than 48 hours.

Extract the ITS regions utilising 6 CPUs and run in the background.

nohup pipits_funits -i out_seqprep/prepped.fasta -o out_funits -x ITS2 -t 6 > pipits_funits.log 2>&1 &

Can check that the process is running by searching for active processes with the name pipits_funits, can also use the top argument and should see vsearch or hmmer functions using the highest CPU allocations.

ps aux | grep pipits_funits
top

When confident that the process is running, can logout of the VM and close the powershell.

Once complete, the final step of the processing pipeline is to process the ITS sequences into the outputs. Again can run this in the background because it will take a few hours.

nohup pipits_process -i out_funits/ITS.fasta -o out_process > pipits_process.log 2>&1 &

When complete can use an additional command included in PIPITS that prepares an OTU table for FUNGuild analysis which will assign functional classifications to OTUs.

pipits_funguild.py -i out_process/otu_table_sintax.txt -o out_process/otu_table_funguild.txt

The PIPITS pipeline for ITS amplicon raw read sequencing files is now complete. The directories and output files of interest are the following:

The log files from the two pipits_funits and pipits_process commands run in the background:

pipits_funits.log
pipits_process.log

pipits_db/ - This contains the most up to date version of the UNITE database at the time of running the command which is downloaded, good to reference.
out_seqprep/ - This contains the prepped sequences ready for ITS extraction, the summary file contains info on numbers of reads before and after filtering.
out_funits/ - Contains the ITS.fasta file with extracted ITS sequences. The summary file explains the number of sequences left after dereplication, how many functional sequences remain after ITS extraction.
out_process/ - The final outputs in both .biom and .txt format

summary.log - Contains info on number of OTUs, number of samples, number of reads used to generate OTU table.
assigned_taxononmy_sintax.txt - Taxonomic assignments for all OTUs.
phylotype_table_sintax.txt - Absolute read abundances for each taxonomic group binned across OTUs.
otu_table_sintax.txt - Absolute read abundances for each OTU across samples, with taxonomic classification of OTUs at the end.
otu_table_funguild.txt - Absolute read abundance OTU table prepped for FUNGuild analysis.

PIPITS for fungal metabarcoding

Conor Scott

2024-12-04

1 Installing PIPITS

2 Transferring files to a VM

2.1 Compression

2.2 Transfer

3 Running PIPITS