Assembling reads into contigs may be enough for you analytical needs e.g. if you just want to predict protein coding regions or assign taxonomies to the contigs to get a broad taxonomic profile of the system.
In this case, contigs can be annotated in different ways and I will demonstrate here the use of the Contig Annotation Tool (CAT) for a conservative taxonomic classification, and Kraken2 for more liberal taxonomic classification.
I would recommend that the contigs from your assembly have been filtered based on length and coverage which will require mapping reads from samples to the assembly. Mapping and filtering are covered in my guide here: https://scottc-bio.github.io/guides/Shotgun-metagenomic-read-mapping.html
CAT (von Meijenfeldt et al. 2019) is useful as it works based on aggregating taxonomic predictions of multiple ORFs on a contig to build support for a contig-level taxonomic assignment. Therefore, protein coding regions are predicted as part of the CAT pipeline which can be used for future functional analysis.
We will create an environment suitable for both CAT and Prodigal
conda create -n cat_env -c bioconda -c conda-forge python=3.10 diamond prodigal
Activate the environment.
conda activate cat_env
Prodigal will be used to predict the coding regions on the contigs which CAT will use for taxonomic classification (Hyatt et al. 2010).
Then navigate to the directory where your filtered contig sequences are, and run the gene prediction. The ‘-i’ argument is the input .fa file of contig sequences, the ‘-a’ argument is the outputted amino acid sequences fasta, the ‘-d’ argument is the outputted nucleotide sequences, the ‘-o’ argument is the outputted gff file containing gene locations, the ‘-f’ argument specifies the output format as gff which is useful in many downstream analysis, and the ‘-p’ argument specifies the metagenomic mode which is essential for contigs as it allows prodicgal to handle the fragmented nature of the sequences more effectively.
prodigal -i filtered_contigs.fa -a contig_proteins.faa -d contig_genes.fna -o contig_genes.gff -f gff -p meta
This might take some time so can run in the background using nohup.
nohup prodigal -i filtered_contigs.fa -a contig_proteins.faa -d contig_genes.fna -o contig_genes.gff -f gff -p meta > prodigal.log 2>&1 &
This will run in the background and store the output in prodigal.log. It will search through each contig and look for genes. So when the line in the output hits the contig number that is the total number of contigs in your filtered assembly, then it is done.
Make a directory in your home directory to clone the CAT Github repository.
mkdir -p ~/CAT_clone
Move into it.
cd ~/CAT_clone
Clone the repository.
git clone https://github.com/dutilh/CAT.git
CAT offers pre-built databases based on either GTDB or the NCBI non-redundant protein database. I am going to use the NCBI nr database.
Go to the following link: https://tbb.bio.uu.nl/tina/CAT_pack_prepare/
There will be .tar.gz files corresponding to the different databases. For example: 20241212_CAT_nr.tar.gz.
Copy the url for this file (or whatever database you want), and use it in the following command:
wget --no-check-certificate https://tbb.bio.uu.nl/tina/CAT_pack_prepare/20241212_CAT_nr.tar.gz
This will begin the download for the pre-constructed NCBI NR database. This is a big file so will take a while. Would recommend running this download in the background using either nohup as before, or using tmux.
This essentially acts as a second screen that can be attached to and detached from whenever you like but will run in the background.
Create a new tmux session.
tmux new -s cat_download
This will open a new terminal where you can navigate to where you cloned your GIT repository. And run the download
wget --no-check-certificate https://tbb.bio.uu.nl/tina/CAT_pack_prepare/20241212_CAT_nr.tar.gz
Then use ‘Ctrl + b’ followed by the ‘d’ key alone to detach from the tmux session and return to your normal shell.
Can re-attach and check progress of download at any time.
tmux attach -t cat_download
Once complete can extract, and I would also recommend doing this in the background using the same tmux session or a new one.
tar -xvzf 20241212_CAT_nr.tar.gz
Once this is complete, you can close the tmux session using ‘exit’ within the tmux window.
First, it is a good idea to make CAT available to run from any location, not just the cloned repository.
So open up your shell profile.
nano ~/.bashrc
At the bottom there should be some lines starting with ‘export’. Add a line to the bottom with the following, but change the path to where you cloned your CAT repository.
export PATH="$HOME/path/to/cloned/CAT/repository/CAT_clone/CAT_pack:$PATH"
Save this change with ‘Ctrl + o’, press ‘Enter’ to confirm, then press ‘Ctrl + x’ to exit.
Reload the shell.
source ~/.bashrc
Now will be able to run CAT from any directory.
Now we are ready to run taxonomic classification with CAT.
Move to a directory where your prodigal outputs are located, along with the contigs.fa file. The ‘-c’ argument needs to be to the same filtered contig .fa file used as the input for prodigal. The ‘-d’ argument needs to be the path to the extracted CAT database which will have a directory called ‘db’. The ‘-t’ argument needs to be the path to the ‘tax’ directory in the extracted CAT database download file. The ‘-p’ argument is the protein sequence .faa outputted from prodigal. The ‘-o’ argument will name the output directory from this process. The ‘–top’ argument determines how CAT uses not just the single best DIAMOND hit, but a group of similarly good hits, so the default of 50 includes all hits within 50% of the top score and uses the lowest common ancestor of these hits to infer the protein taxonomy. Then the taxonomic hits aggregated over the contig define the contig taxonomy. The final two arguments direct the CAT command to the versions of prodigal and diamond.
CAT_pack contigs \
-c filtered_contigs.fa \
-d ~/path/to/CAT/download/CAT/20241212_CAT_nr_website/db \
-t ~/path/to/CAT/download/20241212_CAT_nr_website/tax \
-p contig_proteins.faa \
-o CAT_output \
--top 50 \
--path_to_prodigal $(which prodigal) \
--path_to_diamond $(which diamond)
This may take a while if you have a lot of sequences so I recommend using tmux again to run this in the background.
When complete will have files with the prefix ‘CAT_output’ such as:
The most relevant are the ‘contig2classification’ and ‘ORF2LCA’ files, but the taxonomic classifications are NCBI taxonomy IDs which are not really human readable. But CAT has a built in function to add names from the tax/ directory of the pre-constructed library you downloaded.
If you are in the directory with with CAT output, then you just have to guide it to your CAT database.
CAT_pack add_names -i CAT_output.contig2classification.txt -o CAT_output.contig2classification.named.txt -t ~/CAT_clone/20241212_CAT_nr_website/tax
And then do the same for the ORF2LCA file as well.
CAT_pack add_names -i CAT_output.ORF2LCA.txt -o CAT_output.ORF2LCA.named.txt -t ~/CAT_clone/20241212_CAT_nr_website/tax
You now have informative taxonomic classifications for each of your contigs.
For a broad taxonomic profile, maybe a less conservative approach might suit better. In this case Kraken2 is a fast and easy to use option. Like CAT it requires some database setup though that takes up a lot of space.
Firstly, create a new environment for kraken2.
conda create -n kraken2_env -c bioconda -c conda-forge kraken2
I like to download the most up to date complete version of the database available in the kraken archives. To download and extract the archive, I recommend using a tmux session so it can run in the background. You will also need maybe 200 Gb of free space.
Make a new tmux session.
tmux -attach -t kraken_db
Make a directory for the database and move into it (the date is the version of the db I used).
mkdir kraken2_standard_20250714
cd kraken2_standard_20250714
Download the archive.
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20250714.tar.gz
Extract the archive.
tar -xvzf k2_standard_20250714.tar.gz
If using a tmux session then can detach at any time using ‘Ctrl +B’ followed by ‘d’. And once finished can end the session with ‘exit’ in the tmux window.
Now the database is downloaded it is ready to use with your input contig fasta file, again I would recommend using a filtered set of contigs.
kraken2 --db kraken2_standard_20250714 --threads 8 --output kraken2_contig_output.txt --report kraken2_contig_report.txt filtered_contigs.fa
This will run very quickly, I classified 100,000 sequences in 20 seconds with over 90 % of contigs assigned classifications. It is also much simpler to set up and run than CAT. However, the output is a little more complex and to obtain full lineages, which is often of interest for metagenomic studies, is more difficult.
One thing which is difficult post-processing, is to filter based on confidence scores. It is easier to set the confidence threshold during the run with the ‘–confidence’ argument. I will use 0.2 which was found to be suitable when using the standard kraken database in the following paper: https://doi.org/10.1007/s42994-024-00178-0
kraken2 --db kraken2_standard_20250714 --threads 8 --output kraken2_contig_output.txt --report kraken2_contig_report.txt filtered_contigs.fa --confidence 0.2
The output from kraken is not human readable so it takes a bit of manipulation if we want to get to the full taxonomic lineages for each contig. We can use TaxonKit for this which I recommend installing into the kraken environment you created.
conda install -c bioconda taxonkit
Taxonkit needs a copy of the NCBI taxonomy lineage information. The kraken db contains a copy of some of the files needed from NCBI when the db was created, but to get all the files we can just download the most up to date NCBI taxonomy.
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Then extract.
tar -xvzf taxdump.tar.gz
This will unpack a number of files containing various .dmp files, but we only really need names.dmp, nodes.dmp, delnodes.dmp, and merged.dmp. Taxonkit will by default attempt to look for a directory called ‘taxonkit’ containing these files in the home directory. So let’s make a directory like this and look for it.
mkdir -p ~/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp ~/.taxonkit/
Now with taxon kit set up, we can begin to extract lineages from the tax IDs assigned by kraken.
First extract the contig names and the assigned tax IDs:
awk '{print $2"\t"$3}' kraken2_contig_output.txt > contig_taxid.tsv
Make a unique list of tax IDs.
cut -f2 contig_taxid.tsv | sort -u > taxids.txt
Get the full lineages using taxonkit and format to standard taxonomic ranks.
taxonkit lineage taxids.txt | taxonkit reformat -F -f "{k};{K};{p};{c};{o};{f};{g};{s}" | cut -f1,3 > taxid_lineage.tsv
Sort both the contigs and their tax IDs, and the lineages.
sort -k2,2 contig_taxid.tsv > contig_taxid.sorted.tsv
sort -k1,1 taxid_lineage.tsv > taxid_lineage.sorted.tsv
Join together.
join -t $'\t' -a1 -1 2 -2 1 \
contig_taxid.sorted.tsv \
taxid_lineage.sorted.tsv \
| awk -F'\t' 'BEGIN{OFS="\t"}{print $2,$1,$3}' \
> contig_lineage.tsv
This contig_lineage.tsv should be a tab separated file containing for each contig, it’s tax ID, and the full lineage.