bioinf-mcb / metagenomic-deepfri Goto Github PK

View Code? Open in Web Editor NEW

30.0 5.0 6.0 7.41 MB

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Home Page: https://metagenomic-deepfri.readthedocs.io/

License: GNU General Public License v3.0

Python 91.38% Cython 8.62%

genomics protein-structure mmseqs protein-sequences protein-function-prediction

metagenomic-deepfri's Introduction

🍳 Metagenomic-DeepFRI

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g. gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such needs, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

Search proteins similar to query in PDB and supply FoldComp databases with MMSeqs2.
Find the best alignment among MMSeqs2 hits using PyOpal.
Align target protein contact map to query protein with unknown structure.
Run DeepFRI with the structure if found in the database, otherwise run DeepFRI with sequence only.

🛠️ Built With

🔧 Installation

Install from PyPI. Installation might take a few minutes due to download of MMseqs2 binaries.

pip install mdeepfri

Run and view the help message.

mDeepFRI --help

💡 Usage

1. Prepare structural database

1.1 Existing `FoldComp` databases

The PDB database will be automatically downloaded and installed during the first run of mDeepFRI. The PDB suffers from formatting inconsistencies, therefore during PDB alignment around 10% will fail and will be reported via WARNING. We suggest coupling PDB search with predicted databases, as it massively improves the structural coverage of the protein universe. A good protein structure allows DeepFRI to annotate the function in more detail. However, the sequence branch of the model has the largest weight, thus even if the predicted structure is erroneous, it will have a minor effect on the prediction. The details can be found in the original manuscript, fig. 2A.

You can download additional databases from website. During a first run, FASTA sequences will be extracted from FoldComp database and MMseqs2 database will be created and indexed. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

Tested databases:

afdb_swissprot
afdb_swissprot_v4
afdb_rep_v4
afdb_rep_dark_v4
afdb_uniprot_v4
esmatlas
esmatlas_v2023_02
highquality_clust30

ATTENTION: Please, do not rename downloaded databases. FoldComp has certain inconsistencies in the way FASTA sequences are extracted (example), therefore pipeline was tweaked for each database. If database you need does not work, please report in issues and we will add it as soon as possible. Sorry for the inconvenience.

ATTENTION: database creation is a very sensitive step which relies on external software. If pipeline is interrupted during this step, the databases might be corrupted. If you are not sure about your database, rerun the pipeline with --overwrite flag - it will rerun database creation process.

1.2. Custom `FoldComp` database

In order to use personal database of structures, you will have to create a custom FoldComp database. For that, download a FoldComp executable and run the following command:

foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]

2. Download models

Two versions of models available:

v1.0 - is the original version from DeepFRI publication.
v1.1 - is a version finetuned on AlphaFold models and machine-generated Gene Ontology Uniprot annotations. You can read details about v1.1 in ISMB 2023 presentation by Pawel Szczerbiak

To download models run command:

mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain:

{database_name}.search_results.tsv
query.mmseqsDB + index from MMSeqs2 search.
results.tsv - a final output from the DeepFRI model.

Example output (`results.tsv`)

Protein	GO_term/EC_numer	Score	Annotation	Neural_net	DeepFRI_mode	DB_hit	DB_name	Identity
MIP_00215364	GO:0016798	0.218	hydrolase activity, acting on glycosyl bonds	gcn	mf	MIP_00215364	mip_rosetta_hq	0.933
1GVH_1	GO:0009055	0.217	electron transfer activity	gnn	mf	AF-P24232-F1-model_v4	afdb_swissprot_v4	1.0
unaligned	3.2.1.-	0.215	3.2.1.-	cnn	ec	nan	nan	nan

This is an example of protein annotation with the AlphaFold database.

Protein - the name of the protein from the FASTA file.
GO_term/EC_numer - predicted GO term or EC number (dependent on mode)
Score - DeepFRI score, translates to model confidence in prediction. Details in publication.
Annotation - annotation from ontology
Neural_net - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is used when structural information is available in the database, allowing for generally more confident predictions. When there are no proteins above similarity cut-off (50% identity by default), CNN is used.

DeepFRI_mode:

mf = molecular_function
bp = biological_process
cc = cellular_component
ec = enzyme_commission

DB_hit - name of the hit in the database. Empty if no hit was found.
DB_name - name of the database. Empty if no hit was found.
Identity - sequence identity between query and hit. Empty if no hit was found.

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes:

Molecular Function (MF)
Biological Process (BP)
Cellular Component (CC)
Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.:

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it:

Gligorijević et al. "Structure-based protein function prediction using graph convolutional networks" Nat. Comms. (2021). https://doi.org/10.1038/s41467-021-23303-9
Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nat. Biotechnol. (2017) https://doi.org/10.1038/nbt.3988
Kim, Midrita & Steinegger "Foldcomp: a library and format for compressing and indexing large protein structure sets" Bioinformatics (2023) https://doi.org/10.1093/bioinformatics/btad153
Maranga et al. "Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method" mSystems (2023) https://doi.org/10.1128/msystems.01178-22

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

metagenomic-deepfri's People

Contributors

Stargazers

Watchers

Forkers

valentynbez soliareofastora pupubear007 horikitasaku dmgolembiowski bio-ontology-research-group

metagenomic-deepfri's Issues

Implement top K hits alignment from MMseqs result

Based on test data from E. Coli genome GCA_000731455.1 a lot of sequences have multiple hits. MMSeqs is not precise, therefore we cannot rely on to hit only, but using top k-hit would reduce the dimensionality of the final database significantly.
Here are the distribution of hits to a single protein after using defaults of e_value=10e-5 and identity=0.3.

Top k-filtering reduces the initial size of the database (263,780) substantially.

I propose using only top k=30 hits as a default.

Remove `tensorflow` dependency

Port models to onnx
Validate quality
Validate performance

gcc: error: unrecognized command line option ‘-std=c++17’

Dear all,

Thanks very much for your excellent work.

when I run pip install ., it has a error like this,
`gcc: error: unrecognized command line option ‘-std=c++17’
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for mDeepFRI
Failed to build mDeepFRI
ERROR: Could not build wheels for mDeepFRI, which is required to install pyproject.toml-based projects`

I know that it means the gcc version is too low. It seems that updating the gcc needs the root account.
Do you have any suggestions about it?

Thanks very much for your help!

`ruff` linting

Docs: https://github.com/charliermarsh/ruff

ENH: replace structure lookup with EMBER3D

we might consider replacing the lookup in AF2 DB for EMBER3D. Or, alternatively, to lookup first and if no structure is found generate it on the fly with EMBER3D.

I would love to see what @VGligorijevic thinks

Implement `GradCam` within `ONNX`

After #42
@wwydmanski comment if worth bothering at all - do you want output from old models or you want to replace it with your methods?

fix installation instructions

make them more user friendly.

split for Linux / Mac users
include Conda in the instructions (ie. first, create a conda environment, then run pip install, etc.)

restructure repo

make it more standard. for example, only setup.py should be in the main folder. other .py scripts should live elsewhere.

Improve `Predictor` to write stream to the disk

Currently class Predictor accumulates predictions. It is inconvenient, as the amount of RAM grows with the amount of input.
(current requirement ~ 40GB RAM per 10k protein sequences).

Proposed behaviour
Predictions are written straight into the file and not accumulated in RAM.

Replace alignment from `biopython`

Python code is slow for the amount of alignments we might have, it's better to replace it with something speedier i.e. parasail (python bindings).
Here is the time for E. Coli genome annotation with AF2DB alignments with 32 cores:

[2023-04-14 15:10:29] metagenomic_deepfri.metagenomic_deepfri INFO: Starting metagenomic-DeepFRI.
....
[2023-04-14 15:11:00] search_alignments.search_alignments INFO: Total alignments to check 136531
[2023-04-14 16:47:37] metagenomic_deepfri.metagenomic_deepfri INFO: Processing mode: mf
...
[2023-04-14 17:23:41] metagenomic_deepfri.metagenomic_deepfri INFO: meta-DeepFRI finished successfully

In this procedure alignment took more time than DeepFRI annotation on 10GB VRAM (40 mins). The number of alignments will increase with the size of the database.

Error with Foldcomp databases (AttributeError: 'pyopal.lib.Database' object has no attribute 'search')

log.txt

"pos_weight" in train_DeepFRI.model

In my opinion, pos_weight is to solve the problem of sample imbalance, but it doesn't seem to work. I put it in model.train(), There are errors . Do you have any suggestions

Unable to index afdb_uniprot database

With mDF v1.1.5. I was unable to use Uniprot database, as it was indexed into 8 files: -rw-r--r-- 1 root root 1.1T Apr 3 18:17 afdb_uniprot_v4
-rw-r--r-- 1 root root 4 Apr 5 09:18 afdb_uniprot_v4.dbtype
-rw-r--r-- 1 root root 40G Apr 5 12:31 afdb_uniprot_v4.fasta.gz
-rw-r--r-- 1 root root 5.6G Apr 5 09:21 afdb_uniprot_v4.index
-rw-r--r-- 1 root root 8.5G Apr 5 09:24 afdb_uniprot_v4.lookup
-rw-r--r-- 1 root root 65G Apr 5 12:51 afdb_uniprot_v4.mmseqsDB
-rw-r--r-- 1 root root 4 Apr 5 12:51 afdb_uniprot_v4.mmseqsDB.dbtype
-rw-r--r-- 1 root root 6.8G Apr 5 12:48 afdb_uniprot_v4.mmseqsDB_h
-rw-r--r-- 1 root root 4 Apr 5 12:48 afdb_uniprot_v4.mmseqsDB_h.dbtype
-rw-r--r-- 1 root root 4.7G Apr 5 12:56 afdb_uniprot_v4.mmseqsDB_h.index
-rw-r--r-- 1 root root 371M Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.0
-rw-r--r-- 1 root root 83G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.1
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.2
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.3
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.4
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.5
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.6
-rw-r--r-- 1 root root 82G Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.7
-rw-r--r-- 1 root root 4 Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.dbtype
-rw-r--r-- 1 root root 1.3K Apr 5 14:05 afdb_uniprot_v4.mmseqsDB.idx.index
-rw-r--r-- 1 root root 5.1G Apr 5 12:54 afdb_uniprot_v4.mmseqsDB.index
-rw-r--r-- 1 root root 8.8G Apr 5 12:58 afdb_uniprot_v4.mmseqsDB.lookup
-rw-r--r-- 1 root root 27 Apr 5 12:47 afdb_uniprot_v4.mmseqsDB.source
-rw-r--r-- 1 root root 39M Apr 5 09:24 afdb_uniprot_v4.source

Manual merging afdb_uniprot_v4.mmseqsDB.idx.0-7 files into afdb_uniprot_v4.mmseqsDB.idx did not solve the problem.

log_v115_jc-1475_afdb_uniprot.txt

"Segmentation fault (core dumped)" error

While running mDF v1.1.5. with metagenomic dataset (50 MB input file size) and highquality_clust30 reference DB on 32 threads, I receive the above error. Log stops at the ''mmseqs.filter_mmseqs_results INFO: 594529 pairs after filtering with k=5 best hits.'' and the process is discontinued. Here is my command:

''mDeepFRI --debug predict-function --threads 32 -i /home/lukasz/LS_CF8_transl.faa -d /TomaszLab/metagenomes/foldcomp_database/hclust30/highquality_clust30 -w /TomaszLab/metagenomes/mDF_weights/ -o /TomaszLab/metagenomes/output/COFCO/v115_C8 2> log_C8_hclust.txt''
log_C8_hclust.txt

Replace `fasta_file_io` with `FastaFile` from `pysam`

Docs:
https://pysam.readthedocs.io/en/latest/api.html#fasta-files

Wrong headers in `MMseqs2` results

@lmszydlowski noticed that for some input .faa files the pipeline doesn't predict any GCNs. The problem arises when one uses the following headers:

>gnl|extdb|pgaptmp_000002

or more generally:

>gnl|extdb|pgaptmp_000002 MFS transporter [Microbacterium]

In such case the alignments.json file is always empty. It's not hard to check that it's because of inconsistent header naming in mmseqs2_search_results.m8, namely for the above cases we'are getting:

pgaptmp_000002

instead of:

gnl|extdb|pgaptmp_000002

As a result, the MMseqs2 search results are "not visible" and skipped. This can be a major issue because that kind of formatting is very frequently outputted by external bioinformatic tools.

Redesign the CLI to fit the need of workflow manager

Usually workflow managers base on the concept:
input -> command -> output
So, we need to structure the CLI in the following way:
python_script.py --input input.fasta --output output_dir --database database_dir --threads 8

Deprecate:

project names and folders

Empty mmseqs results file

I receive an error 'pyopal not found' even though I have pyopal installed. I receive the same error regardless if I have pyopla 0.5.1 or 0.4.1 (as recommended)

Remove `torch` dependency to run on GPU

Running models on GPU is beneficial
ONNX is failing to create GPU models without torch import, see microsoft/onnxruntime#11092
torch is, therefore, a dependency, and it is a really bulky one.

There should be a way to import needed modules without torch.

CUDA 12.0
onnxruntime-gpu 1.15.1

Build fails

Build fails with the following error:

Building wheel for mDeepFRI (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for mDeepFRI (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [51 lines of output]
    /tmp/pip-build-env-svcl1p9q/overlay/lib/python3.10/site-packages/setuptools/dist.py:745: SetuptoolsDeprecationWarning: Invalid dash-separated options
    !!
    
            ********************************************************************************
            Usage of dash-separated 'description-file' will not be supported in future
            versions. Please use the underscore name 'description_file' instead.
    
            By 2023-Sep-26, you need to update your project and remove deprecated calls
            or your builds will no longer be supported.
    
            See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
            ********************************************************************************
    
    !!
      opt = self.warn_dash_deprecation(opt, section)
    running bdist_wheel
    running build
    running build_py
    running egg_info
    writing mDeepFRI.egg-info/PKG-INFO
    writing dependency_links to mDeepFRI.egg-info/dependency_links.txt
    writing entry points to mDeepFRI.egg-info/entry_points.txt
    writing top-level names to mDeepFRI.egg-info/top_level.txt
    reading manifest file 'mDeepFRI.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file 'mDeepFRI.egg-info/SOURCES.txt'
    running build_ext
    cmake /home/ubuntu/Metagenomic-DeepFRI -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/home/ubuntu/Metagenomic-DeepFRI/build/lib.linux-x86_64-cpython-310/mDeepFRI/CPP_lib -DPY_INCLUDE_PATH=/home/ubuntu/miniconda3/envs/deepfri/include/python3.10/ -DCMAKE_BUILD_TYPE=Release
    -- Configuring done (0.0s)
    -- Generating done (0.0s)
    -- Build files have been written to: /home/ubuntu/Metagenomic-DeepFRI/build/temp.linux-x86_64-cpython-310
    cmake --build . --config Release -- -j4
    [ 25%] Building CXX object CMakeFiles/AtomDistanceIO.dir/mDeepFRI/CPP_lib/load_contact_maps.cpp.o
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp: In function ‘std::pair<bool*, int> LoadDenseContactMap(const string&, float)’:
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp:58:24: error: invalid initialization of reference of type ‘std::unique_ptr<float []>&’ from expression of type ‘float*’
       58 |           if (Distance(atoms_positions, atom_a, atom_b) <=
          |                        ^~~~~~~~~~~~~~~
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp:11:49: note: in passing argument 1 of ‘float Distance(std::unique_ptr<float []>&, int, int)’
       11 | static float Distance(std::unique_ptr<float[]> &array, int i, int j) {
          |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp: In function ‘std::vector<std::pair<int, int> >* LoadSparseContactMap(const string&, float)’:
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp:106:24: error: invalid initialization of reference of type ‘std::unique_ptr<float []>&’ from expression of type ‘float*’
      106 |           if (Distance(atoms_positions, atom_a, atom_b) <=
          |                        ^~~~~~~~~~~~~~~
    /home/ubuntu/Metagenomic-DeepFRI/mDeepFRI/CPP_lib/load_contact_maps.cpp:11:49: note: in passing argument 1 of ‘float Distance(std::unique_ptr<float []>&, int, int)’
       11 | static float Distance(std::unique_ptr<float[]> &array, int i, int j) {
          |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
    make[2]: *** [CMakeFiles/AtomDistanceIO.dir/build.make:104: CMakeFiles/AtomDistanceIO.dir/mDeepFRI/CPP_lib/load_contact_maps.cpp.o] Error 1
    make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/AtomDistanceIO.dir/all] Error 2
    make: *** [Makefile:91: all] Error 2
    error: command '/home/ubuntu/miniconda3/envs/deepfri/bin/cmake' failed with exit code 2
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for mDeepFRI
Failed to build mDeepFRI
ERROR: Could not build wheels for mDeepFRI, which is required to install pyproject.toml-based projects

Move parses to `Cython`

This will improve retrieval from foldcomp.

Update docker image repository name

In the README there is docker image linked to my personal repository. In the future you will be unable to update it.

docker run -it -u $(id -u):$(id -g) -v /YOUR_DATA_ROOT:/data soliareofastora/metagenomic-deepfri

unit tests

test reference database built
GCN / CNN split

`mmseqs` uses all CPU cores of the machine

During mmseqs search command all cores of the machine are used.

Alignment fails for huge proteins (extreme case)

for proteins over 30k aa alignment fails
currently investigating althonos/pyopal#3

Current behaviour

Report alignment failure in the log, use CNN only.

Hierarchical multiple database search

The quality of available structures is widely different
Thus, we should be able to align queries to multiple databases in a hierarchical manner

Proposed hierarchy:

PDB -> MIP -> AF2 -> ESMFold

Current issue

FoldComp is optimised for computationally predicted structures, we cannot ship it in FoldComp format.

`parse_mmcif` fails to give adequate outputs

mmcif format does not contain sequence data, which will cause the database build to fail. Two ways:

deprecate it and remove (I think pdb is much more common anyways)
add function to load corresponding FASTA data

Look for test data: https://github.com/valentynbez/Metagenomic-DeepFRI/tree/folder-structure-patch
What do you think @tkosciol?

Implement `pdbfixer`

Resolve problematic PDB entries using pdbfixer: https://github.com/openmm/pdbfixer

feature request: multiple template structures for GCN module

in the case where there are multiple non-overlapping or partially overlapping hits in the MMSeqs2 step, pick all templates and merge them.
For example, the query is a 300 residue protein (1-300). TemplateA hits residues 1-200 and TemplateB hits 150-300. We would like to use both, because they cover different parts of the query.

@PawelSzczerbiak it is partly similar to what we were doing the the microprot pipeline, so maybe bottowing some of the subroutines from there would make sense?

update installation instructions

Current installation instructions are not up to date with repo contents.
For example:

libboost libraries are no longer required
post_setup.py no longer exists

Output vectors instead of GO-terms

Output vectors can be analyzed further using different dimensionality reduction techniques in larger datasets.
Instead of mapping to a single go term, the output can be the last layer of the NN.

PROPOSED IMPROVEMENT

Add --output-type option with choices go-terms and vectors.

Problem with running the pipeline

log2.txt
log3.txt

After installing mDF on 2 different VMs with afdb_swissprot v4 and 2 different queries (bacterial proteomes, previously annotated) I encounter the following errors (see attached logs). In both cases it tries to align the sequences against python3.

Move from `Boost.Python` to `Cython`

Robustness - all Python code can be profiled for bottlenecks & the slowest parts of the pipeline can be rewritten C/C++
Speed up the alignment part

Feature request: add `--translate` option to annotate genes

Annotation of gene catalogues is a common workflow during metagenomic analysis, therefore this function would be very useful.
eggNOG-mapper delivers a similar functionality for annotating proteins, but also allows for gene annotations. See translate_cds_to_prots in version 2.1.6 of eggNOG-mapper.
Link: https://github.com/eggnogdb/eggnog-mapper/blob/d6e6cdf0a829f2bd85480f3f3f16e38c213cd091/eggnogmapper/utils.py

mDeepFRI --help

Traceback (most recent call last):
  File "/home/oem/miniconda3/envs/deepfri/bin/mDeepFRI", line 33, in <module>
    sys.exit(load_entry_point('mDeepFRI==1.1.6', 'console_scripts', 'mDeepFRI')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/deepfri/bin/mDeepFRI", line 25, in importlib_load_entry_point
    return next(matches).load()
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/deepfri/lib/python3.11/importlib/metadata/__init__.py", line 198, in load
    module = import_module(match.group('module'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oem/miniconda3/envs/deepfri/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1149, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/oem/bioinformatics-tools/Metagenomic-DeepFRI/Metagenomic-DeepFRI/mDeepFRI/cli.py", line 11, in <module>
    from mDeepFRI.pipeline import (hierarchical_database_search,
  File "/home/oem/bioinformatics-tools/Metagenomic-DeepFRI/Metagenomic-DeepFRI/mDeepFRI/pipeline.py", line 14, in <module>
    from mDeepFRI.bio_utils import build_align_contact_map
  File "/home/oem/bioinformatics-tools/Metagenomic-DeepFRI/Metagenomic-DeepFRI/mDeepFRI/bio_utils.py", line 14, in <module>
    from mDeepFRI.alignment_utils import align_contact_map, pairwise_sqeuclidean
ModuleNotFoundError: No module named 'mDeepFRI.alignment_utils'

Indeed I do not see alignment_utils anywhere - I either see alignment module or bio_utils.

ll mDeepFRI
total 2696
drwxrwxr-x  4 oem oem    4096 cze  6 16:30 ./
drwxrwxr-x 11 oem oem    4096 cze  6 16:31 ../
-rw-rw-r--  1 oem oem    7706 cze  6 16:27 alignment.py
-rw-rw-r--  1 oem oem    9432 cze  6 16:27 bio_utils.py
-rw-rw-r--  1 oem oem   12613 cze  6 16:27 cli.py
-rw-rw-r--  1 oem oem 1196132 cze  6 16:30 contact_map_utils.cpp
-rw-rw-r--  1 oem oem    6421 cze  6 16:27 contact_map_utils.pyx
-rw-rw-r--  1 oem oem    3095 cze  6 16:27 database.py
-rw-rw-r--  1 oem oem    4194 cze  6 16:27 __init__.py
-rw-rw-r--  1 oem oem   21704 cze  6 16:27 mmseqs.py
-rw-rw-r--  1 oem oem    4591 cze  6 16:27 pdb.py
-rw-rw-r--  1 oem oem   13870 cze  6 16:27 pipeline.py
-rw-rw-r--  1 oem oem 1420033 cze  6 16:30 predict.cpp
-rw-rw-r--  1 oem oem    4379 cze  6 16:27 predict.pyx
drwxrwxr-x  2 oem oem    4096 cze  6 16:27 __pycache__/
drwxrwxr-x  3 oem oem    4096 cze  6 16:27 tests/
-rw-rw-r--  1 oem oem    7834 cze  6 16:27 utils.py

Is the repository up to date, or are you in the middle of some changes/development?

Thanks in advance :)

Error aligning proteins from `PDB`

Behaviour

PDB is non-uniform resource with many problems with data standardization

[2024-03-21 11:16:12] bio_utils.retrieve_align_contact_map DEBUG: Aligning contact map for 3rv2_B against gnl_extdb_JC-0055tmp_000138_methionine_adenosyltransferase_[Microbacterium]
Traceback (most recent call last):
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/bio_utils.py", line 424, in retrieve_align_contact_map
    aligned_cmap = align_contact_map(alignment.gapped_sequence,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/bio_utils.py", line 364, in align_contact_map
    sparse_map = list(
                 ^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/bio_utils.py", line 367, in <lambda>
    (target_to_query_indices[x[0]], target_to_query_indices[x[1]]),
                                    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: 373

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/bin/mDeepFRI", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/cli.py", line 185, in predict_function
    predict_protein_function(
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/pipeline.py", line 158, in predict_protein_function
    cmap = retrieve_align_contact_map(aln, db.foldcomp_db,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/cds-peta/exports/biol_micro_cds_gr_sunagawa/scratch/vbezshapkin/conda-envs/deepfri/lib/python3.11/site-packages/mDeepFRI/bio_utils.py", line 430, in retrieve_align_contact_map
    raise ValueError(f"Error aligning contact map for {idx} against {alignment.query_name}\n"
ValueError: Error aligning contact map for 3rv2_B against gnl_extdb_JC-0055tmp_000138_methionine_adenosyltransferase_[Microbacterium]

Proposed solution

return empty alignment in case it fails - only happens for certain misformatted (?) proteins
add the flag --skip-pdb to CLI - skips PDB search completely

Coupling `highquality_clust30` database from ESM

currently database cannot be effectively indexed with samtools-faidx because of the inconsistent FASTA headers (see issue: steineggerlab/foldcomp#51)
A small hack using sed helps to glue database and correct headers

Not optimal solution, might break in the future.