This guide will cover the cloud-based usage of the DeepTMHMM tool (Hallgren et al. 2022) for the prediction of transmembrane protein topology, plus signal peptides which can be run from the terminal.

A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html

A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.

DeepTMHMM is a freely available package store on Biolib, essentially a cloud-based computing server which means that actually the sequences will be sent to the servers for processing and the results returned. Therefore if you have amino acid sequences from sensitive subjects e.g. human data, then you may wish to be careful using this tool.

There is a way to run DeepTMHMM completely locally using the ‘docker’ image. But in my experience, access to these is often restricted on academic institution computing clusters. But if you have set up your own VM and have complete access I reccommend reading the instructions on local running (https://dtu.biolib.com/DeepTMHMM), as this will probably speed the process up.

1 Installing biolib

Firstly, create a conda environment.

conda create -n deeptmhmm python=3.10 -y
conda activate deeptmhmm

Install biolib.

pip install biolib

Ensure this is in the correct location.

which biolib

This should return something like: user/miniconda3/envs/deeptmhmm/bin/biolib. If it instead points somewhere not in the conda environment then there is another version installed on your system. However we can be safe in any case by pointing explicitly to the version in your correctly set up conda environment.

2 Running DeepTMHMM from the terminal

Then just run the process in the cloud using your fasta file.

$CONDA_PREFIX/bin/biolib run DTU/DeepTMHMM --fasta example.fasta

For me, 100 proteins in the test run took anywhere from 2 - 5 minutes. But you are relying on a server so this could vary greatly and the sequences have to be uploaded so a slow internet connection will increase the time.

3 Parsing the output into something meaningful

The output is not easily comprehensible and so I have a python script which parses the results from the outputted ‘predicted_topologies.3line’ file so that each sequence becomes a single row with the following columns:

  • Sequence ID - from fasta header
  • Length - sequence length (bp)
  • Num_TM_helices - based on the number times a string of Ms appear in the topology indicating a TM region
  • TM_helices(start-end) - the start and end positions of each string of Ms
  • Prediction - the DeepTMHMM prediction of whether there is a SP (secreted), whether the protein is TM (transmembrane), or GLOB (globular)
  • Topology - For each amino acid residue there is a letter code signifying where that residue is. S = signal peptide region, I = inside (cytoplasmic side), O = outside (extracellular or lumen side), M = membrane (inside a TM helix), B = beta-barrel TM segment, X = unknown/ambiguous

Open a python file.

nano deeptmhmm_parse.py

Copy the following code into the file.

import csv
import re
import sys

input_file = sys.argv[1]  # e.g., 'biolib_results/predicted_topologies.3line'
output_file = sys.argv[2]  # e.g., 'deeptmhmm_summary.txt'

def find_tm_helices(topology):
    """Find TM helices as stretches of 'M'"""
    helices = []
    for match in re.finditer(r"M+", topology):
        start = match.start() + 1
        end = match.end()
        helices.append((start, end))
    return helices

with open(input_file) as f, open(output_file, "w", newline="") as out:
    writer = csv.writer(out)
    writer.writerow([
        "Sequence_ID",
        "Length",
        "Num_TM_helices",
        "TM_helices(start-end)",
        "Prediction",
        "Topology"
    ])

    lines = f.readlines()
    for i in range(0, len(lines), 3):
        # The first line contains ID and prediction label separated by ' | '
        header = lines[i].strip()
        if " | " in header:
            seq_id, prediction = header[1:].split(" | ")
        else:
            seq_id = header[1:]
            prediction = ""
        seq = lines[i+1].strip()
        topology = lines[i+2].strip()

        tm_coords = find_tm_helices(topology)
        tm_count = len(tm_coords)
        tm_str = ";".join([f"{s}-{e}" for s, e in tm_coords])

        writer.writerow([
            seq_id,
            len(seq),
            tm_count,
            tm_str,
            prediction,
            topology
        ])

print(f"Saved summary to {output_file}")

Press Ctrl+O and hit enter to save. Then press Ctrl+X to exit.

Make the file executable.

chmod +x deeptmhmm_parse.py

Then run the parsing script.

python deeptmhmm_parse.py biolib_results/predicted_topologies.3line deeptmhmm_summary.txt

You now have the results of your DeepTMHMM predicton of transmembrane regions.

References

Hallgren, Jeppe, Konstantinos D. Tsirigos, Mads Damgaard Pedersen, José Juan Almagro Armenteros, Paolo Marcatili, Henrik Nielsen, Anders Krogh, and Ole Winther. 2022. DeepTMHMM Predicts Alpha and Beta Transmembrane Proteins Using Deep Neural Networks.” bioRxiv. doi:10.1101/2022.04.08.487609.