This guide will cover the download, installation, and local use of GTDB-Tk (Chaumeil et al. 2020) for the classification of MAGs.
A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html
A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.
If you have followed my guide on binning and filtering to obtain MAGs (https://scottc-bio.github.io/guides/Shotgun-metagenomic-contig-binning.html), then you should have a directory containing .fa files for each bin which represent the sequences of your MAGs. We will use these for taxonomic classification.
The developers have made installing GTDB-Tk and its associated database very simple with conda. The following line of code is to create an environment for the most recent version of gtdbtk at the time of writing. But you can find the most up to date version by visiting the website (https://ecogenomics.github.io/GTDBTk/installing/bioconda.html) and adjusting accordingly.
conda create -n gtdbtk_env -c conda-forge -c bioconda gtdbtk=2.6.1 tqdm -y
conda activate gtdbtk_env
This makes a download script available to download the most recent version of the database, extract it, and path everything correctly. This can be quite useful.
This might take over an hour, so a ‘tmux’ session could be used to run this in the background.
This essentially acts as a second screen that can be attached to and detached from whenever you like but will run in the background.
Create a new tmux session.
tmux new -s gtdbtk_download
This will open a new terminal where you can run the download script. You will need to reactivate the conda environment.
download-db.sh
Use ‘Ctrl + b’ followed by the ‘d’ key to detach from the tmux session and return to your normal shell.
Can re-attach and check progress of download at any time.
tmux attach -t gtdbtk_download
Once complete, I would recommend running the taxonomic classification in this tmux shell as well as it is a long process.
The issue with the automatic script is that the database will only be available in the conda environment. To make it available system-wide, I prefer a manual install. I also find this easier to manage and debug.
As the file is quite large, we will use aria2c to download which essentially multithreads downloads.
So install this into the conda environment.
conda install -c conda-forge aria2 -y
Make a directory for the download somewhere accessible e.g. near your home directory, and then move into it.
mkdir ~/databases/gtdbtk/
cd ~/databases/gtdbtk
Then download the database with multi-threading.
aria2c -x 16 -s 16 -c https://data.gtdb.aau.ecogenomic.org/releases/release226/226.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r226_data.tar.gz
Verify the download by checking the size. It should be ~132 Gb.
ls -lh gtdbtk_r226_data.tar.gz
Extract the database.
tar -xzf gtdbtk_r226_data.tar.gz
This will unpack a directory within your ‘gtdbtk/’ directory calles ‘release226’, or whatever release number you donwloaded.
Once complete, within the conda environment we want to make the database accessible universally.
conda env config vars set GTDBTK_DATA_PATH=~/databases/gtdbtk/releases226
To make these changes take effect, deactivate and re-activate the environment.
conda deactivate
conda activate gtdbtk_env
Can then check the install using the built-in shell script.
gtdbtk check_install
This will run through all the downloaded files and check the integrity. If everything comes back with green ‘OK’ then you are ready to go.
Move to the directory above your MAGs, which should be in a directory with .fa files for each bin. They may also be in .fasta format depending the binning tool you used.
Now to run the classification a single command can be used. The ‘–genome_dir’ argument is the path to the directory with your bins (MAGs) in, the ‘–out_dir’ is the argument for the name of the output directory that will be created and populated with output files, the ‘–cpus 16’ is the number of CPUs allocated to this process, the ‘–skip_ani_screen’ is an argument that tells gtdbtk to skip the screening step because I have already filtered my bins to high_quality, but if you haven’t done this then you can change this argument to ‘–mash_db’ which includes a dereplication step, and finally the ‘–extension’ argument is used to specify the file extensions of the bins as .fa as gtdbk assumes by default that the extension will be .fasta.
gtdbtk classify_wf \
--genome_dir bins/ \
--out_dir gtdbtk_output \
--cpus 16 \
--skip_ani_screen \
--extension fa
This does not take too long to run, for 25 bins it took me < 2 hours with 16 CPUs. But can also be run in the background using tmux.
The output directory from gtdbtk will have the following:
The .tsv file is of the most importance and has the following columns:
Generally it seems that the GTDB-Tk classification system is pretty stringent, but there could be some unconfident tree placements.
Basically this is the workflow for GTDB-Tk:
Therefore personally I use the filtering logic of: “If a FastANI reference is not identified, and a msa_percent of less than 50 % is achieved for tree placement, then discard that bin’s taxonomic classification.”