This guide will cover the download, installation, and local use of GTDB-Tk (Chaumeil et al. 2020) for the classification of MAGs.

A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html

A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.

If you have followed my guide on binning and filtering to obtain MAGs (https://scottc-bio.github.io/guides/Shotgun-metagenomic-contig-binning.html), then you should have a directory containing .fa files for each bin which represent the sequences of your MAGs. We will use these for taxonomic classification.

1 Installing GTDB-Tk

The developers have made installing GTDB-Tk and its associated database very simple with conda. The following line of code is to create an environment for the most recent version of gtdbtk at the time of writing. But you can find the most up to date version by visiting the website (https://ecogenomics.github.io/GTDBTk/installing/bioconda.html) and adjusting accordingly.

conda create -n gtdbtk_env -c conda-forge -c bioconda gtdbtk=2.6.1 tqdm -y
conda activate gtdbtk_env

1.1 Utilising the download shell script

This makes a download script available to download the most recent version of the database, extract it, and path everything correctly. This can be quite useful.

This might take over an hour, so a ‘tmux’ session could be used to run this in the background.

This essentially acts as a second screen that can be attached to and detached from whenever you like but will run in the background.

Create a new tmux session.

tmux new -s gtdbtk_download

This will open a new terminal where you can run the download script. You will need to reactivate the conda environment.

download-db.sh

Use ‘Ctrl + b’ followed by the ‘d’ key to detach from the tmux session and return to your normal shell.

Can re-attach and check progress of download at any time.

tmux attach -t gtdbtk_download

Once complete, I would recommend running the taxonomic classification in this tmux shell as well as it is a long process.

1.2 Manual download of the database

The issue with the automatic script is that the database will only be available in the conda environment. To make it available system-wide, I prefer a manual install. I also find this easier to manage and debug.

As the file is quite large, we will use aria2c to download which essentially multithreads downloads.

So install this into the conda environment.

conda install -c conda-forge aria2 -y

Make a directory for the download somewhere accessible e.g. near your home directory, and then move into it.

mkdir ~/databases/gtdbtk/
cd ~/databases/gtdbtk

Then download the database with multi-threading.

aria2c -x 16 -s 16 -c https://data.gtdb.aau.ecogenomic.org/releases/release226/226.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r226_data.tar.gz

Verify the download by checking the size. It should be ~132 Gb.

ls -lh gtdbtk_r226_data.tar.gz

Extract the database.

tar -xzf gtdbtk_r226_data.tar.gz

This will unpack a directory within your ‘gtdbtk/’ directory calles ‘release226’, or whatever release number you donwloaded.

Once complete, within the conda environment we want to make the database accessible universally.

conda env config vars set GTDBTK_DATA_PATH=~/databases/gtdbtk/releases226

To make these changes take effect, deactivate and re-activate the environment.

conda deactivate
conda activate gtdbtk_env

Can then check the install using the built-in shell script.

gtdbtk check_install

This will run through all the downloaded files and check the integrity. If everything comes back with green ‘OK’ then you are ready to go.

2 Running GTDB-Tk classification of MAGs

Move to the directory above your MAGs, which should be in a directory with .fa files for each bin. They may also be in .fasta format depending the binning tool you used.

Now to run the classification a single command can be used. The ‘–genome_dir’ argument is the path to the directory with your bins (MAGs) in, the ‘–out_dir’ is the argument for the name of the output directory that will be created and populated with output files, the ‘–cpus 16’ is the number of CPUs allocated to this process, the ‘–skip_ani_screen’ is an argument that tells gtdbtk to skip the screening step because I have already filtered my bins to high_quality, but if you haven’t done this then you can change this argument to ‘–mash_db’ which includes a dereplication step, and finally the ‘–extension’ argument is used to specify the file extensions of the bins as .fa as gtdbk assumes by default that the extension will be .fasta.

gtdbtk classify_wf \
  --genome_dir bins/ \
  --out_dir gtdbtk_output \
  --cpus 16 \
  --skip_ani_screen \
  --extension fa

This does not take too long to run, for 25 bins it took me < 2 hours with 16 CPUs. But can also be run in the background using tmux.

3 Understanding the GTDB-Tk output

The output directory from gtdbtk will have the following:

  • align/ - Contains the multiple sequences alignments of marker genes from MAGs and the reference genomes in the database. Mainly for internal use.
  • classify/ - The internal classification results files for each bin. Not really in human readable format.
  • identify/ - The output of marker gene identification, also used internally and not for human reading.
  • gtdbtk.json - A structured summary of the entire run, including parameters and system metadata. Mainly useful for provenance and reproducibility.
  • gtdbtk.log - Output from the run, useful for debugging if something goes wrong.
  • gtdbtk.warnings.log - Warnings like misalignments, low ANI scores. Worth checking. Hopefully empty.
  • gtdbtk.bac120.summary.tsv - The key output file where each row is a bin with it’s classification taxonomy results.

The .tsv file is of the most importance and has the following columns:

  • user_genome - Name of input MAG (e.g. bin.1.fa)
  • classification - Final GTDB taxonomy assigned to the genome
  • fastani_reference - GTDB reference genome with the highest FastANI match to your MAG
  • fastani_reference_radius - The ANI-based species radius threshold for the FastANI reference. Used to define if the MAG falls within the reference species
  • fastani_taxonomy - THe GTDB taxonomy of the FastANI reference genome
  • fastani_ani - ANI (Average Nucleotide Identity) score between your MAG and the best-matching GTDB genome. High ANI > 95 % suggests same species
  • fastani_af - Alignment fractions, the proportion of the MAG that could be aligned to the reference
  • closest_placement_reference - The GTDB reference genome that your MAG is closest to based on phylogenetic placement
  • closest_placement_radius - The ANI radius of the closest placement reference genome
  • closest_placement_taxonomy - GTDB taxonomy of the closest phylogenetic reference
  • closest_placement_ani - ANI between your MAG and the closest phylogenetic placement genome
  • closest_placement_af - Alignment fraction between your MAG and the phylogenetic placement reference
  • pplacer_taxonomy - Taxonomy based on marker gene phylogenetic placement using pplacer
  • classification_method - Whether GTDB-Tk used ANI, pplacer, or both for the final taxonomy
  • note - Notes about the classification e.g. if classification relied only pplacer, or if ANI match was too low
  • other_related_references(genome_id,species_name,radius,ANI,AF) - A list of other GTDB genomes closely related to your MAG with ANI/AF details
  • msa_percent - Percentage of marker genes in the msa covered by your mag, higher is better
  • translation_table - The genetic code table used for annotation
  • red_value - Relative Evolutionary Divergence, a measure of how evolutionary distinct your MAG is. Higher values mean more divergence from known genomes.
  • warnings - Any warnings e.g. low alignment, short contigs.

4 Notes on post-hoc filtering of GTDB-Tk results

Generally it seems that the GTDB-Tk classification system is pretty stringent, but there could be some unconfident tree placements.

Basically this is the workflow for GTDB-Tk:

  1. Use FastANI to identify reference genomes in the database with over 95% average nucleotide identity (ANI) and over 65 % alignment fraction (AF).
  2. If this fails, fallback to pplacer to use a set of 120 bacterial marker genes to align to the MAG and then use this to place the MAG in the reference genome tree

Therefore personally I use the filtering logic of: “If a FastANI reference is not identified, and a msa_percent of less than 50 % is achieved for tree placement, then discard that bin’s taxonomic classification.”

References

Chaumeil, Pierre-Alain, Aaron J Mussig, Philip Hugenholtz, and Donovan H Parks. 2020. GTDB-Tk: A Toolkit to Classify Genomes with the Genome Taxonomy Database.” Bioinformatics 36 (6): 1925–27. doi:10.1093/bioinformatics/btz848.