Giter Site home page Giter Site logo

genomenlp's Introduction

DOI Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

genomeNLP: Genome recoding for Machine Learning Usage incorporating genomicBERT

NOTE: The main repository is on github but is also mirrored on gitlab. Please submit any issues to the main github repository only.

Copyright (c) 2022 Tyrone Chen ORCID logo, Navya Tyagi ORCID logo, Sarthak Chauhan, Anton Y. Peleg ORCID logo, and Sonika Tyagi ORCID logo.

Code in this repository is provided under a MIT license. This documentation is provided under a CC-BY-3.0 AU license.

Visit our lab website here. Contact Sonika Tyagi at [email protected].

Highlights

  • We provide a comprehensive classification of genomic data tokenisation and representation approaches for ML applications along with their pros and cons.
  • Using our genomicBERT deep learning pipeline, we infer k-mers directly from the data and handle out-of-vocabulary words. At the same time, we achieve a significantly reduced vocabulary size compared to the conventional k-mer approach reducing the computational complexity drastically.
  • Our method is agnostic to species or biomolecule type as it is data-driven.
  • We enable comparison of trained model performance without requiring original input data, metadata or hyperparameter settings.
  • We present the first publicly available, high-level toolkit that infers the grammar of genomic data directly through artificial neural networks.
  • Preprocessing, hyperparameter sweeps, cross validations, metrics and interactive visualisations are automated but can be adjusted by the user as needed.

graphical abstract describing the repository

Cite us with:

Will be provided on publication (currently in review)

A preprint is available, which will be replaced by the publication once online.

Cite our manuscript here:

@article{chen2023genomicbert,
    title={genomicBERT and data-free deep-learning model evaluation},
    author={Chen, Tyrone and Tyagi, Navya and Chauhan, Sarthak and Peleg, Anton Y and Tyagi, Sonika},
    journal={bioRxiv},
    month={jun},
    pages={2023--05},
    year={2023},
    publisher={Cold Spring Harbor Laboratory},
    doi={10.1101/2023.05.31.542682},
    url={https://doi.org/10.1101/2023.05.31.542682}
}

Cite our software here:

@software{tyrone_chen_2023_8135591,
    author       = {Tyrone Chen and
                    Navya Tyagi and
                    Sarthak Chauhan and
                    Anton Y. Peleg and
                    Sonika Tyagi},
    title        = {{genomicBERT and data-free deep-learning model 
                    evaluation}},
    month        = jul,
    year         = 2023,
    publisher    = {Zenodo},
    version      = {latest},
    doi          = {10.5281/zenodo.8135590},
    url          = {https://doi.org/10.5281/zenodo.8135590}
}

Install

Mamba (automated)

This is the recommended install method as it automatically handles dependencies. Note that this has only been tested on a linux operating system. Remember to include the conda-forge channel during install or in your anaconda configuration.

NOTE: Installing with mamba is highly recommended. Installing with pip will not work. Installing with conda will be slow. You can find instructions for setting up mamba here.

First try this:

mamba install -c tyronechen -c conda-forge genomenlp

If there are any errors with the previous step (especially if you are on a cluster with GPU access), try this first and then repeat the previous step:

mamba install -c anaconda cudatoolkit

If neither works, please submit an issue with the full stack trace and any supporting information.

Mamba (manual)

Clone the git repository. This will also allow you to manually run the python scripts.

Then manually install the following dependencies with mamba. Installing with pip will not work as some distributions are not available on pip.:

datasets==2.10.1
gensim==4.2.0
hyperopt==0.2.7
matplotlib==3.5.2
pandas==1.4.2
pytorch==1.10.0
ray-default==1.13.0
scikit-learn==1.1.1
scipy==1.10.1
screed==1.0.5
seaborn==0.11.2
sentencepiece==0.1.96
tabulate==0.9.0
tokenizers==0.12.1
tqdm==4.64.0
transformers==4.23.0
transformers-interpret==0.8.1
wandb==0.13.4
weightwatcher==0.6.4
xgboost==1.7.1
yellowbrick==1.3.post1

You should then be able to run the scripts manually from src/genomenlp. As with the automated step, cudatoolkit may be required.

Usage

Please refer to the documentation for detailed usage information of the package and the genomicBERT pipeline.

Acknowledgements

TC was supported by an Australian Government Research Training Program (RTP) Scholarship and Monash Faculty of Science Dean’s Postgraduate Research Scholarship. ST acknowledges support from Early Mid-Career Fellowship by Australian Academy of Science and Australian Women Research Success Grant at Monash University. AP and ST acnowledge MRFF funding for the SuperbugAI flagship. This work was supported by the MASSIVE HPC facility and the authors thank the Monash Bioinformatics Platform as well as the HPC team at Monash eResearch Centre for their continuous personnel support. We thank Yashpal Ramakrishnaiah for helpful suggestions on package management, code architecture and documentation hosting. We thank Jane Hawkey for advice on recovering deprecated bacterial protein identifier mappings in NCBI. We thank Andrew Perry and Richard Lupat for helping resolve an issue with the python package building process. Biorender was used to create many figures in this publication. We acknowledge and pay respects to the Elders and Traditional Owners of the land on which our 4 Australian campuses stand.

genomenlp's People

Contributors

biotod avatar navya003 avatar sarthak99204 avatar stepwise-ai-dev avatar tsonika avatar tyronechen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.