This guide will cover the download, installation, and local use of the Augustus 3.5.0 (Hoff and Stanke 2019) for the prediction of ORFs from eukaryotic sequences.
A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html
A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.
Create a conda environment and install augustus.
conda create -n augustus_env -c bioconda -c conda-forge augustus=3.5.0
conda activate augustus_env
Check augustus is installed in the conda path.
which augustus
This should return something like “home/miniconda3/envs/augustus_env/bin/augustus”.
Make a directory somewhere appropriate to store the files used by the augustus tool.
mkdir -p $HOME/tools/augustus/config
Copy the augustus configuration files to this new directory.
cp -r $CONDA_PREFIX/config/* $HOME/tools/augustus/config/
Then need to make the path to augustus files global. So open up the .bashrc.
nano ~/.bashrc
Scroll down to the lines where there is export info, or just the last line in the file and add this line.
export AUGUSTUS_CONFIG_PATH=$HOME/tools/augustus/config
Use Ctrl+X, then ‘y’, then Ctrl+X to save and exit.
Reload the .bashrc shell.
source ~/.bashrc
Augustus needs a species model to be selected which will be used as the basis for finding genes on the contigs.
You can bring up all the species options with.
augustus --species=help
There are many species to choose from. If you have a general mixed community metagenome, you will have to pick something to represent your community (or run augustus multiple times with different species models). For example, if you are interested in the fungal community of a soil sample then Aspergillus nidulans might be suitable as a proxy. The advantage of Augustus is that it is lightweight because it already has these pre-trained models. But for the same reason it will not be completely accurate in gene prediction and will vary with the species model picked. This is important to consider if you want to be more specific, e.g. predicting genes from contigs of a single genome.
Another important consideration is that eukaryotic gene prediction is limited for short contigs (e.g. < 5 Kbp). In my experience, many shotgun metagenomic contis fall below this, and although some genes will be predicted, longer contigs would be better.
Augustus although lightweight for eukaryotic gene prediction, is much heavier than prokaryotic ORF prediction tools like Prodigal due to the increased complexity of eukaryotic genes. Don’t waste your time running Augustus on all your contigs if you don’t know if they are eukaryotic or not. Firstly, classify contigs as eukaryotic or prokaryotic with a lightweight tool like EukRep, which I have a guide for : https://scottc-bio.github.io/guides/EukRep0.6.7.html
Augustus can now be run with a single line.
augustus --species=aspergillus_nidulans --genemodel=partial --strand=both --gff3=on --protein=on --codingseq=on --uniqueGeneId=true --progress=true input_contigs.fasta > predicted_genes_out.gff
This uses the following parameters:
Augustus has a lot more parameters that can be viewed as follows.
augustus --paramlist
Important note
Augustus has a character cutoff of 50 for sequence headers, therefore if your contigs are not different within the first 50 characters you will lose information on which contigs the genes were identified on. The uniqueGeneId parameter does add a suffix to ensure predicted genes do not have the completely same name, but you may lose contig level resolution.
The protein sequences will be embedded in the comments of each gene in the .gff file. These can be extracted to a .fasta file of the protein sequences.
gffread predicted_genes_out.gff -g input_contigs.fasta -y predicted_proteins.fasta
This will produce two files: