This guide will cover the cloud-based usage of the DeepTMHMM tool (Hallgren et al. 2022) for the prediction of transmembrane protein topology, plus signal peptides which can be run from the terminal.
A Linux based system with Miniconda will be required for this guide and some high computing power is suggested, such a system might exist for your institution or can be purchased through an online cloud computing provide. Follow my guide on cloud based VM setup here: https://scottc-bio.github.io/guides/Virtual-machines-for-bioinformatics.html
A basic understanding of LINUX such as creating and moving directories etc. is assumed for this guide.
DeepTMHMM is a freely available package store on Biolib, essentially a cloud-based computing server which means that actually the sequences will be sent to the servers for processing and the results returned. Therefore if you have amino acid sequences from sensitive subjects e.g. human data, then you may wish to be careful using this tool.
There is a way to run DeepTMHMM completely locally using the ‘docker’ image. But in my experience, access to these is often restricted on academic institution computing clusters. But if you have set up your own VM and have complete access I reccommend reading the instructions on local running (https://dtu.biolib.com/DeepTMHMM), as this will probably speed the process up.
Firstly, create a conda environment.
conda create -n deeptmhmm python=3.10 -y
conda activate deeptmhmm
Install biolib.
pip install biolib
Ensure this is in the correct location.
which biolib
This should return something like: user/miniconda3/envs/deeptmhmm/bin/biolib. If it instead points somewhere not in the conda environment then there is another version installed on your system. However we can be safe in any case by pointing explicitly to the version in your correctly set up conda environment.
Then just run the process in the cloud using your fasta file.
$CONDA_PREFIX/bin/biolib run DTU/DeepTMHMM --fasta example.fasta
For me, 100 proteins in the test run took anywhere from 2 - 5 minutes. But you are relying on a server so this could vary greatly and the sequences have to be uploaded so a slow internet connection will increase the time.
The output is not easily comprehensible and so I have a python script which parses the results from the outputted ‘predicted_topologies.3line’ file so that each sequence becomes a single row with the following columns:
Open a python file.
nano deeptmhmm_parse.py
Copy the following code into the file.
import csv
import re
import sys
input_file = sys.argv[1] # e.g., 'biolib_results/predicted_topologies.3line'
output_file = sys.argv[2] # e.g., 'deeptmhmm_summary.txt'
def find_tm_helices(topology):
"""Find TM helices as stretches of 'M'"""
helices = []
for match in re.finditer(r"M+", topology):
start = match.start() + 1
end = match.end()
helices.append((start, end))
return helices
with open(input_file) as f, open(output_file, "w", newline="") as out:
writer = csv.writer(out)
writer.writerow([
"Sequence_ID",
"Length",
"Num_TM_helices",
"TM_helices(start-end)",
"Prediction",
"Topology"
])
lines = f.readlines()
for i in range(0, len(lines), 3):
# The first line contains ID and prediction label separated by ' | '
header = lines[i].strip()
if " | " in header:
seq_id, prediction = header[1:].split(" | ")
else:
seq_id = header[1:]
prediction = ""
seq = lines[i+1].strip()
topology = lines[i+2].strip()
tm_coords = find_tm_helices(topology)
tm_count = len(tm_coords)
tm_str = ";".join([f"{s}-{e}" for s, e in tm_coords])
writer.writerow([
seq_id,
len(seq),
tm_count,
tm_str,
prediction,
topology
])
print(f"Saved summary to {output_file}")
Press Ctrl+O and hit enter to save. Then press Ctrl+X to exit.
Make the file executable.
chmod +x deeptmhmm_parse.py
Then run the parsing script.
python deeptmhmm_parse.py biolib_results/predicted_topologies.3line deeptmhmm_summary.txt
You now have the results of your DeepTMHMM predicton of transmembrane regions.