Giter Site home page Giter Site logo

shirapti-nath / genomic_benchmarks Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ml-bioinfo-ceitec/genomic_benchmarks

0.0 0.0 0.0 25.15 MB

Benchmarks for classification of genomic sequences

License: Apache License 2.0

Python 0.37% Jupyter Notebook 99.63%

genomic_benchmarks's Introduction

Genomic Benchmarks πŸ§¬πŸ‹οΈβœ”οΈ

In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models.

Hackathon 2021-11-19

We have collected a list of genomic datasets and are now organizing the ML hackathon to train classifiers over them. Would you join us on Friday, November 19, 2021, 15:00 CET at CEITEC MU, Brno, Czechia πŸ‡¨πŸ‡ΏπŸ‡ͺπŸ‡Ί, or remotely? Free refreshment for all participants, swag for the winners. The event is both competitive (to prove your ML models are the best) and a learning opportunity (we will provide all the help we can).

  • Final datasets and evaluation metrics will be provided on the day of the hackathon. In principle, they will be similar to datasets currently included in this package.
  • You can participate both in person at CEITEC or remotely. More information at bit.ly/genomichackathon, sign up here. No prior knowledge about DNA/RNA/genetics needed (you must be able to code in Python and know ML basics).
  • To participate on-site, you must be vaccinated, recovered or tested (O-N-T regulations analogical to German G3 apply). Please, bring FFP2 mask.

Install

Genomic Benchmarks can be installed as follows:

# maintained for and tested on Python version 3.8
git clone https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks.git
cd genomic_benchmarks
pip install --editable .

# if you want to train NN with TF
pip install tensorflow>=2.6.0
pip install typing-extensions --upgrade  # fixing TF installation issue

# if you want to train NN with torch
pip install torch>=1.10.0
pip install torchtext

For the package development, use Python 3.8 (ideally 3.8.9) and the environment described here.

Usage

Get the list of all datasets with the list_datasets function

from genomic_benchmarks.data_check import list_datasets

print(list_datasets())

You can get basic information about the benchmark with info function:

from genomic_benchmarks.data_check import info

info("human_nontata_promoters", version=0)

The function download_dataset downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in .genomic_benchmarks subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True).

from genomic_benchmarks.loc2seq import download_dataset

download_dataset("human_nontata_promoters", version=0)

Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:

from pathlib import Path
import tensorflow as tf

BATCH_SIZE = 64
SEQ_TRAIN_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters' / 'train'
CLASSES = ['negative', 'positive']

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    directory=SEQ_TRAIN_PATH,
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

print(list(train_dset)[0])

See How_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.

Getting Pytorch Dataset and displaying samples is also easy:

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanNontataPromoters

dset = HumanNontataPromoters(split='train', version=0)
print(dset[0])

See How_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.

Introduction

[WHY ARE BENCHMARKS IMPORTANT?]

[WHAT BENCHMARKS ARE GENOMIC BENCHMARKS?]

Structure of package

  • datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), see README.md for the format specification
  • docs: Each folder contains a Python notebook that has been used for the dataset creation
  • experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
  • notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
  • src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
  • tests: Unit tests for pytest and pytest-cov

How to contribute

TBD

genomic_benchmarks's People

Contributors

simecek avatar katarinagresova avatar davidcechak avatar martinekv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.