1 Introduction

Assembling reads into contigs may be enough for you analytical needs e.g. if you just want to predict protein coding regions or assign taxonomies to the contigs to get a broad taxonomic profile of the system.

In this case, contigs can be annotated in different ways and I will demonstrate here the use of the Contig Annotation Tool (CAT) for a conservative taxonomic classification, and Kraken2 for more liberal taxonomic classification.

I would recommend that the contigs from your assembly have been filtered based on length and coverage which will require mapping reads from samples to the assembly. Mapping and filtering are covered in my guide here: https://scottc-bio.github.io/guides/Shotgun-metagenomic-read-mapping.html

2 Contig annotation with CAT

CAT (von Meijenfeldt et al. 2019) is useful as it works based on aggregating taxonomic predictions of multiple ORFs on a contig to build support for a contig-level taxonomic assignment. Therefore, protein coding regions are predicted as part of the CAT pipeline which can be used for future functional analysis.

2.1 Installing CAT and Prodigal

We will create an environment suitable for both CAT and Prodigal

conda create -n cat_env -c bioconda -c conda-forge python=3.10 diamond prodigal

Activate the environment.

conda activate cat_env

2.2 Predict genes with prodigal

Prodigal will be used to predict the coding regions on the contigs which CAT will use for taxonomic classification (Hyatt et al. 2010).

Then navigate to the directory where your filtered contig sequences are, and run the gene prediction. The ‘-i’ argument is the input .fa file of contig sequences, the ‘-a’ argument is the outputted amino acid sequences fasta, the ‘-d’ argument is the outputted nucleotide sequences, the ‘-o’ argument is the outputted gff file containing gene locations, the ‘-f’ argument specifies the output format as gff which is useful in many downstream analysis, and the ‘-p’ argument specifies the metagenomic mode which is essential for contigs as it allows prodicgal to handle the fragmented nature of the sequences more effectively.

prodigal -i filtered_contigs.fa -a contig_proteins.faa -d contig_genes.fna -o contig_genes.gff -f gff -p meta 

This might take some time so can run in the background using nohup.

nohup prodigal -i filtered_contigs.fa -a contig_proteins.faa -d contig_genes.fna -o contig_genes.gff -f gff -p meta > prodigal.log 2>&1 &

This will run in the background and store the output in prodigal.log. It will search through each contig and look for genes. So when the line in the output hits the contig number that is the total number of contigs in your filtered assembly, then it is done.

2.3 CAT setup

Make a directory in your home directory to clone the CAT Github repository.

mkdir -p ~/CAT_clone

Move into it.

cd ~/CAT_clone

Clone the repository.

git clone https://github.com/dutilh/CAT.git

2.4 Preparing a database

CAT offers pre-built databases based on either GTDB or the NCBI non-redundant protein database. I am going to use the NCBI nr database.

Go to the following link: https://tbb.bio.uu.nl/tina/CAT_pack_prepare/

There will be .tar.gz files corresponding to the different databases. For example: 20241212_CAT_nr.tar.gz.

Copy the url for this file (or whatever database you want), and use it in the following command:

wget --no-check-certificate https://tbb.bio.uu.nl/tina/CAT_pack_prepare/20241212_CAT_nr.tar.gz

This will begin the download for the pre-constructed NCBI NR database. This is a big file so will take a while. Would recommend running this download in the background using either nohup as before, or using tmux.

This essentially acts as a second screen that can be attached to and detached from whenever you like but will run in the background.

Create a new tmux session.

tmux new -s cat_download

This will open a new terminal where you can navigate to where you cloned your GIT repository. And run the download

wget --no-check-certificate https://tbb.bio.uu.nl/tina/CAT_pack_prepare/20241212_CAT_nr.tar.gz

Then use ‘Ctrl + b’ followed by the ‘d’ key alone to detach from the tmux session and return to your normal shell.

Can re-attach and check progress of download at any time.

tmux attach -t cat_download

Once complete can extract, and I would also recommend doing this in the background using the same tmux session or a new one.

tar -xvzf 20241212_CAT_nr.tar.gz

Once this is complete, you can close the tmux session using ‘exit’ within the tmux window.

2.5 Making CAT available anywhere

First, it is a good idea to make CAT available to run from any location, not just the cloned repository.

So open up your shell profile.

nano ~/.bashrc

At the bottom there should be some lines starting with ‘export’. Add a line to the bottom with the following, but change the path to where you cloned your CAT repository.

export PATH="$HOME/path/to/cloned/CAT/repository/CAT_clone/CAT_pack:$PATH"

Save this change with ‘Ctrl + o’, press ‘Enter’ to confirm, then press ‘Ctrl + x’ to exit.

Reload the shell.

source ~/.bashrc

Now will be able to run CAT from any directory.

2.6 Running CAT for taxonomic assignment

Now we are ready to run taxonomic classification with CAT.

Move to a directory where your prodigal outputs are located, along with the contigs.fa file. The ‘-c’ argument needs to be to the same filtered contig .fa file used as the input for prodigal. The ‘-d’ argument needs to be the path to the extracted CAT database which will have a directory called ‘db’. The ‘-t’ argument needs to be the path to the ‘tax’ directory in the extracted CAT database download file. The ‘-p’ argument is the protein sequence .faa outputted from prodigal. The ‘-o’ argument will name the output directory from this process. The ‘–top’ argument determines how CAT uses not just the single best DIAMOND hit, but a group of similarly good hits, so the default of 50 includes all hits within 50% of the top score and uses the lowest common ancestor of these hits to infer the protein taxonomy. Then the taxonomic hits aggregated over the contig define the contig taxonomy. The final two arguments direct the CAT command to the versions of prodigal and diamond.

CAT_pack contigs \
  -c filtered_contigs.fa \
  -d ~/path/to/CAT/download/CAT/20241212_CAT_nr_website/db \
  -t ~/path/to/CAT/download/20241212_CAT_nr_website/tax \
  -p contig_proteins.faa \
  -o CAT_output \
  --top 50 \
  --path_to_prodigal $(which prodigal) \
  --path_to_diamond $(which diamond)

This may take a while if you have a lot of sequences so I recommend using tmux again to run this in the background.

When complete will have files with the prefix ‘CAT_output’ such as:

  • CAT_output.contig2classification.txt - The main results file with taxonomies assigned to each contig
  • CAT_output.ORF2LCA.txt - Maps each predicted ORF to its Lowest Common Ancestor (LCA) taxon. Could be useful if you are interested in specific ORFs.
  • CAT_output.alignment.diamond - - DIAMOND alignment output of ORFs to the NCBI nr database. Can be used in the future with CAT to skip the alignment stage.
  • CAT_output.log - The .log file of the entire run

2.7 Adding names to the classifications

The most relevant are the ‘contig2classification’ and ‘ORF2LCA’ files, but the taxonomic classifications are NCBI taxonomy IDs which are not really human readable. But CAT has a built in function to add names from the tax/ directory of the pre-constructed library you downloaded.

If you are in the directory with with CAT output, then you just have to guide it to your CAT database.

CAT_pack add_names -i CAT_output.contig2classification.txt -o CAT_output.contig2classification.named.txt  -t ~/CAT_clone/20241212_CAT_nr_website/tax

And then do the same for the ORF2LCA file as well.

CAT_pack add_names -i CAT_output.ORF2LCA.txt -o CAT_output.ORF2LCA.named.txt  -t ~/CAT_clone/20241212_CAT_nr_website/tax

You now have informative taxonomic classifications for each of your contigs.

3 Contig annotation with Kraken2

For a broad taxonomic profile, maybe a less conservative approach might suit better. In this case Kraken2 is a fast and easy to use option. Like CAT it requires some database setup though that takes up a lot of space.

3.1 Installing Kraken2

Firstly, create a new environment for kraken2.

conda create -n kraken2_env -c bioconda -c conda-forge kraken2

3.2 Set up the Kraken2 database

I like to download the most up to date complete version of the database available in the kraken archives. To download and extract the archive, I recommend using a tmux session so it can run in the background. You will also need maybe 200 Gb of free space.

Make a new tmux session.

tmux -attach -t kraken_db

Make a directory for the database and move into it (the date is the version of the db I used).

mkdir kraken2_standard_20250714
cd kraken2_standard_20250714

Download the archive.

wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20250714.tar.gz

Extract the archive.

tar -xvzf k2_standard_20250714.tar.gz

If using a tmux session then can detach at any time using ‘Ctrl +B’ followed by ‘d’. And once finished can end the session with ‘exit’ in the tmux window.

3.3 Running Kraken2

Now the database is downloaded it is ready to use with your input contig fasta file, again I would recommend using a filtered set of contigs.

kraken2 --db kraken2_standard_20250714 --threads 8 --output kraken2_contig_output.txt --report kraken2_contig_report.txt filtered_contigs.fa

This will run very quickly, I classified 100,000 sequences in 20 seconds with over 90 % of contigs assigned classifications. It is also much simpler to set up and run than CAT. However, the output is a little more complex and to obtain full lineages, which is often of interest for metagenomic studies, is more difficult.

One thing which is difficult post-processing, is to filter based on confidence scores. It is easier to set the confidence threshold during the run with the ‘–confidence’ argument. I will use 0.2 which was found to be suitable when using the standard kraken database in the following paper: https://doi.org/10.1007/s42994-024-00178-0

kraken2 --db kraken2_standard_20250714 --threads 8 --output kraken2_contig_output.txt --report kraken2_contig_report.txt filtered_contigs.fa --confidence 0.2

3.4 Setting up TaxonKit

The output from kraken is not human readable so it takes a bit of manipulation if we want to get to the full taxonomic lineages for each contig. We can use TaxonKit for this which I recommend installing into the kraken environment you created.

conda install -c bioconda taxonkit

Taxonkit needs a copy of the NCBI taxonomy lineage information. The kraken db contains a copy of some of the files needed from NCBI when the db was created, but to get all the files we can just download the most up to date NCBI taxonomy.

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Then extract.

tar -xvzf taxdump.tar.gz

This will unpack a number of files containing various .dmp files, but we only really need names.dmp, nodes.dmp, delnodes.dmp, and merged.dmp. Taxonkit will by default attempt to look for a directory called ‘taxonkit’ containing these files in the home directory. So let’s make a directory like this and look for it.

mkdir -p ~/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp ~/.taxonkit/

3.5 Reaching taxonomic lineages for contigs

Now with taxon kit set up, we can begin to extract lineages from the tax IDs assigned by kraken.

First extract the contig names and the assigned tax IDs:

awk '{print $2"\t"$3}' kraken2_contig_output.txt > contig_taxid.tsv

Make a unique list of tax IDs.

cut -f2 contig_taxid.tsv | sort -u > taxids.txt

Get the full lineages using taxonkit and format to standard taxonomic ranks.

taxonkit lineage taxids.txt   | taxonkit reformat -F -f "{k};{K};{p};{c};{o};{f};{g};{s}"   | cut -f1,3   > taxid_lineage.tsv

Sort both the contigs and their tax IDs, and the lineages.

sort -k2,2 contig_taxid.tsv > contig_taxid.sorted.tsv
sort -k1,1 taxid_lineage.tsv > taxid_lineage.sorted.tsv

Join together.

join -t $'\t' -a1 -1 2 -2 1 \
  contig_taxid.sorted.tsv \
  taxid_lineage.sorted.tsv \
| awk -F'\t' 'BEGIN{OFS="\t"}{print $2,$1,$3}' \
> contig_lineage.tsv

This contig_lineage.tsv should be a tab separated file containing for each contig, it’s tax ID, and the full lineage.

References

Hyatt, Doug, Gwo-Liang Chen, Philip F. LoCascio, Miriam L. Land, Frank W. Larimer, and Loren J. Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (1): 119. doi:10.1186/1471-2105-11-119.
von Meijenfeldt, F. A. Bastiaan, Ksenia Arkhipova, Diego D. Cambuy, Felipe H. Coutinho, and Bas E. Dutilh. 2019. “Robust Taxonomic Classification of Uncharted Microbial Sequences and Bins with CAT and BAT.” Genome Biology 20 (1): 217. doi:10.1186/s13059-019-1817-x.