This guide will cover the download, installation, and local use of the Augustus 3.5.0 (Hoff and Stanke 2019) for the prediction of ORFs from eukaryotic sequences.

A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html

A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.

1 Installing Augustus

Create a conda environment and install augustus.

conda create -n augustus_env -c bioconda -c conda-forge augustus=3.5.0
conda activate augustus_env

Check augustus is installed in the conda path.

which augustus

This should return something like “home/miniconda3/envs/augustus_env/bin/augustus”.

Make a directory somewhere appropriate to store the files used by the augustus tool.

mkdir -p $HOME/tools/augustus/config

Copy the augustus configuration files to this new directory.

cp -r $CONDA_PREFIX/config/* $HOME/tools/augustus/config/

Then need to make the path to augustus files global. So open up the .bashrc.

nano ~/.bashrc

Scroll down to the lines where there is export info, or just the last line in the file and add this line.

export AUGUSTUS_CONFIG_PATH=$HOME/tools/augustus/config

Use Ctrl+X, then ‘y’, then Ctrl+X to save and exit.

Reload the .bashrc shell.

source ~/.bashrc

2 Important considerations before starting

Augustus needs a species model to be selected which will be used as the basis for finding genes on the contigs.

You can bring up all the species options with.

augustus --species=help

There are many species to choose from. If you have a general mixed community metagenome, you will have to pick something to represent your community (or run augustus multiple times with different species models). For example, if you are interested in the fungal community of a soil sample then Aspergillus nidulans might be suitable as a proxy. The advantage of Augustus is that it is lightweight because it already has these pre-trained models. But for the same reason it will not be completely accurate in gene prediction and will vary with the species model picked. This is important to consider if you want to be more specific, e.g. predicting genes from contigs of a single genome.

Another important consideration is that eukaryotic gene prediction is limited for short contigs (e.g. < 5 Kbp). In my experience, many shotgun metagenomic contis fall below this, and although some genes will be predicted, longer contigs would be better.

Augustus although lightweight for eukaryotic gene prediction, is much heavier than prokaryotic ORF prediction tools like Prodigal due to the increased complexity of eukaryotic genes. Don’t waste your time running Augustus on all your contigs if you don’t know if they are eukaryotic or not. Firstly, classify contigs as eukaryotic or prokaryotic with a lightweight tool like EukRep, which I have a guide for : https://scottc-bio.github.io/guides/EukRep0.6.7.html

3 Running Augustus

Augustus can now be run with a single line.

 augustus --species=aspergillus_nidulans --genemodel=partial --strand=both --gff3=on --protein=on --codingseq=on --uniqueGeneId=true --progress=true input_contigs.fasta > predicted_genes_out.gff

This uses the following parameters:

  • species - the trained gene model to use
  • genemodel - allows partial genes to be predicted which is important for fragmented contigs
  • strand - predicts genes on both DNA strands
  • gff3 - outputs the results in gff3-compliant format
  • protein - outputs predicted amino acid sequence embedded in gff3 comments
  • codingseq - outputs predicted coding sequence in nucleotide space embedded in gff3 comments
  • uniqueGeneId - Ensures that each gene and transcript has a unique ID suffixed to the end of the contig header
  • progress - displays a progress meter while augustus runs

Augustus has a lot more parameters that can be viewed as follows.

augustus --paramlist

Important note

Augustus has a character cutoff of 50 for sequence headers, therefore if your contigs are not different within the first 50 characters you will lose information on which contigs the genes were identified on. The uniqueGeneId parameter does add a suffix to ensure predicted genes do not have the completely same name, but you may lose contig level resolution.

4 Extracting protein sequences

The protein sequences will be embedded in the comments of each gene in the .gff file. These can be extracted to a .fasta file of the protein sequences.

gffread predicted_genes_out.gff -g input_contigs.fasta -y predicted_proteins.fasta

This will produce two files:

  • predicted_proteins.fasta - the fasta file containing protein sequences
  • input_contigs.fasta.fai - an index file which can be used in downstream processes

References

Hoff, Katharina J., and Mario Stanke. 2019. “Predicting Genes in Single Genomes with AUGUSTUS.” Current Protocols in Bioinformatics 65 (1): e57. doi:10.1002/cpbi.57.