Giter Site home page Giter Site logo

bio-ontology-research-group / pu-go Goto Github PK

View Code? Open in Web Editor NEW
2.0 11.0 1.0 952 KB

Predicting protein functions using positive-unlabeled ranking with ontology-based priors

Python 98.73% Shell 1.27%
positive-unlabeled-learning protein-function-prediction

pu-go's Introduction

Predicting protein functions using positive-unlabeled ranking with ontology-based priors

Abstract

Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the \emph{false negative} issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled classification problem. We apply empirical risk minimization, i.e., we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets

Dependencies

  • Python 3.10
  • PyTorch 2.1.0
  • FAIR-ESM (for predicting over FASTA files)
  • Other basic dependencies can be installed by running: conda env create -f environment.yml
  • Install diamond program on your system (diamond command should be available)

Data Availability

cd PU-GO
wget https://bio2vec.cbrc.kaust.edu.sa/data/pugo/pu-go-data.tar.gz
tar -xzvf pu-go-data.tar.gz
  • For each subontology directory [mf, cc, bp] run the following command to extract the 10 trained models:
cd data/mf
tar -xzvf models.tar.gz

Scripts

  • pu_go.py script to train/test PU-GO on the similarity-based split. We show the commands with the selected hyperparameters for each subontology.

    • MFO: python pu_go.py -dr data/ -ont mf --run 0 --batch_size 35 --loss_type pu_ranking_multi --margin_factor 0.01152 --max_lr 0.004573 --min_lr_factor 0.0008792 --prior 0.000143 -ld
    • CCO: python pu_go.py -dr data/ -ont cc --run 0 --batch_size 30 --loss_type pu_ranking_multi --margin_factor 0.09812 --max_lr 0.0008681 --min_lr_factor 0.08364 --prior 0.0001106 -ld
    • BPO: python pu_go.py -dr data/ -ont bp --run 0 --batch_size 39 --loss_type pu_ranking_multi --margin_factor 0.02457 --max_lr 0.0004305 --min_lr_factor 0.09056 --prior 0.0007845 -ld
  • pu_go_time.py script to train PU-GO on the time-based split.

  • predict.py script to predict functions given a FASTA file.

    • Usage: python predict.py -if your_fasta_file.fa -d cuda

Note: We used the ESM2-15B model using a NVIDIA A100 GPU.

Citation (TODO)

pu-go's People

Contributors

coolmaksat avatar ferzcam avatar lilv98 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ferzcam

pu-go's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.