Giter Site home page Giter Site logo

ml-bioinfo-ceitec / genomic_benchmarks Goto Github PK

View Code? Open in Web Editor NEW
103.0 3.0 13.0 24.81 MB

Benchmarks for classification of genomic sequences

License: Apache License 2.0

Python 7.41% Jupyter Notebook 92.54% Shell 0.05%
deep-learning dataset genomics genomics-data tensorflow pytorch

genomic_benchmarks's Introduction

PyPI version

Genomic Benchmarks 🧬🏋️✔️

In this repository, we collect benchmarks for classification of genomic sequences. It is shipped as a Python package, together with functions helping to download & manipulate datasets and train NN models. Current SOTA model on genomic benchmarks is HyenaDNA, see metrics in the experiments folder.

Install

Genomic Benchmarks can be installed as follows:

pip install genomic-benchmarks

To use it with papermill, TF or pytorch, install the corresponding dependencies:

# if you want to use jupyter and papermill
pip install jupyter>=1.0.0
pip install papermill>=2.3.0

# if you want to train NN with TF
pip install tensorflow>=2.6.0
pip install tensorflow-addons
pip install typing-extensions --upgrade  # fixing TF installation issue

# if you want to train NN with torch
pip install torch>=1.10.0
pip install torchtext

For the package development, use Python 3.8 (ideally 3.8.9) and the installation described here.

Usage

Get the list of all datasets with the list_datasets function

>>> from genomic_benchmarks.data_check import list_datasets
>>> 
>>> list_datasets()
['demo_coding_vs_intergenomic_seqs', 'demo_human_or_worm', 'dummy_mouse_enhancers_ensembl', 'human_enhancers_cohn', 'human_enhancers_ensembl', 'human_ensembl_regulatory',  'human_nontata_promoters', 'human_ocr_ensembl']

You can get basic information about the benchmark with info function:

>>> from genomic_benchmarks.data_check import info
>>> 
>>> info("human_nontata_promoters", version=0)
Dataset `human_nontata_promoters` has 2 classes: negative, positive.

All lenghts of genomic intervals equals 251.

Totally 36131 sequences have been found, 27097 for training and 9034 for testing.
          train  test
negative  12355  4119
positive  14742  4915

The function download_dataset downloads the full-sequence form of the required benchmark (splitted into train and test sets, one folder for each class). If not specified otherwise, the data will be stored in .genomic_benchmarks subfolder of your home directory. By default, the dataset is obtained from our cloud cache (use_cloud_cache=True).

>>> from genomic_benchmarks.loc2seq import download_dataset
>>> 
>>> download_dataset("human_nontata_promoters", version=0)
Downloading 1VdUg0Zu8yfLS6QesBXwGz1PIQrTW3Ze4 into /home/petr/.genomic_benchmarks/human_nontata_promoters.zip... Done.
Unzipping...Done.
PosixPath('/home/petr/.genomic_benchmarks/human_nontata_promoters')

Getting TensorFlow Dataset for the benchmark and displaying samples is straightforward:

>>> from pathlib import Path
>>> import tensorflow as tf
>>> 
>>> BATCH_SIZE = 64
>>> SEQ_TRAIN_PATH = Path.home() / '.genomic_benchmarks' / 'human_nontata_promoters' / 'train'
>>> CLASSES = ['negative', 'positive']
>>> 
>>> train_dset = tf.keras.preprocessing.text_dataset_from_directory(
...     directory=SEQ_TRAIN_PATH,
...     batch_size=BATCH_SIZE,
...     class_names=CLASSES)
Found 27097 files belonging to 2 classes.
>>> 
>>> list(train_dset)[0][0][0]
<tf.Tensor: shape=(), dtype=string, numpy=b'TCCTGCCTTTCCACTTGCACCAGTTTTCCCACCCCAGCCTCAGGGCGGGGCTGCCTCGTCACTTGTCTCGGGGCAGATCTGCCCTACACACGTTAGCGCCGCGCGCAAAGCAGCCCCGCAGCACCCAGGCGCCTCCTGGCGGCGCCGCGAAGGGGCGGGGCTGTCGGCTGCGCGTTGTGCGCTGTCCCAGGTTGGAAACCAGTGCCCCAGGCGGCGAGGAGAGCGGTGCCTTGCAGGGATGCTGCGGGCGG'>

See How_To_Train_CNN_Classifier_With_TF.ipynb for more detailed description how to train CNN classifier with TensorFlow.

Getting Pytorch Dataset and displaying samples is also easy:

>>> from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanNontataPromoters
>>> 
>>> dset = HumanNontataPromoters(split='train', version=0)
>>> dset[0]
('CAATCTCACAGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGGCAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCACATGGCAAGAAGGTGCTGACTTCCTTGGGAGATGCCATAAAGCACCTGGATGATCTCAAGGGCACCTTTGCCCAGCTGAGTGAACTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGGTGAGTCCAGGAGATGT', 0)

See How_To_Train_CNN_Classifier_With_Pytorch.ipynb for more detailed description how to train CNN classifier with Pytorch.

Hugging Face

We also provide these benchmarks through HuggingFace Hub: https://huggingface.co/katarinagresova

If you are used to using Hugging Face dataset, you can use this option to access Genomic Benchmarks. See How_To_Use_Datasets_From_HF.ipynb for a guide.

Structure of package

  • datasets: Each folder is one benchmark dataset (or a set of bechmarks in subfolders), see README.md for the format specification
  • docs: Each folder contains a Python notebook that has been used for the dataset creation
  • experiments: Training a simple neural network model(s) for each benchmark dataset, can be used as a baseline
  • notebooks: Main use-cases demonstrated in a form of Jupyter notebooks
  • src/genomic_benchmarks: Python module for datasets manipulation (downlading, checking, etc.)
  • tests: Unit tests for pytest and pytest-cov

How to contribute

How to contribute a model

If you beat our current best model on any dataset or just came with an interesting new idea, let us know about it: Make you code publicly available (GitHub repo, Colab...) and fill in the form at

https://forms.gle/pvkkrgHNCNmAAC1TA

How to contribute a dataset

If you have an interesting genomic dataset, send us an issue with the description and possibly link to the data (e.g. BED file and FASTQ reference). In the future, we will provide functions to make the import easy.

If you are a hero, read the specification of our dataset format and send us a pull request with new datasets/[YOUR_DATASET_NAME] and docs/[YOUR_DATASET_NAME] folders.

How to improve code in this package

We welcome new code contributors. If you see a bug, send us an issue with a minimal reproducible example. Or even better, fix the bug and send us a pull request.

Citing Genomic Benchmarks

If you use Genomic Benchmarks in your research, please cite it as follows.

Text

Grešová, Katarína, et al. "Genomic benchmarks: a collection of datasets for genomic sequence classification." BMC Genomic Data 24.1 (2023): 25.

BibTeX

@article{grevsova2023genomic,
  title={Genomic benchmarks: a collection of datasets for genomic sequence classification},
  author={Gre{\v{s}}ov{\'a}, Katar{\'\i}na and Martinek, Vlastimil and {\v{C}}ech{\'a}k, David and {\v{S}}ime{\v{c}}ek, Petr and Alexiou, Panagiotis},
  journal={BMC Genomic Data},
  volume={24},
  number={1},
  pages={25},
  year={2023},
  publisher={Springer}
}

genomic_benchmarks's People

Contributors

davidcechak avatar katarinagresova avatar martinekv avatar simecek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

genomic_benchmarks's Issues

Add explicit numericalization of labels

Currently, the numericalization of labels is dependent on the order of the folders in the filesystem.
A possible improvement would be to explicitly define a file with mapping information.
For example

{
  'positive':0,
  'negative':1,
}

Alternatively the order of classes in metadata.yaml file of each dataset could be used.
This would allow the user to explicitly filter the data by label and make the numericalization more consistent.

The following are code snippets that rely on the order of folders to label samples.

def labels_in_order(dset_name):
dir_path = CACHE_PATH / dset_name
true_labels = []
for i, label_path in enumerate(Path(dir_path / "test").iterdir()):
for j in label_path.iterdir():
true_labels.append(i)
return true_labels

for i, x in enumerate(base_path.iterdir()):
label_mapper[x.stem] = i

Tensorflow notebook demo ⬇️

CLASSES = [x.stem for x in (SEQ_PATH/'train').iterdir() if x.is_dir()]

train_dset = tf.keras.preprocessing.text_dataset_from_directory(
    SEQ_PATH / 'train',
    batch_size=BATCH_SIZE,
    class_names=CLASSES)

masked DNA strings

There are some DNA strings in the datasets that either partially or entirely consist of masked strings, e.g., the 7th sequence in the DemoHumanOrWorm training set (checked via dset[6]), is a string of 'NNNNNNN....NNNN'. Maybe consider extracting the DNA strings from the unmasked genome?

Download of HumanEnhancersEnsembl (PyTorch) dataset is not working

The following code:

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersEnsembl

X_train = HumanEnhancersEnsembl(split="train", version=0)

produces the error:

Downloading 1gZBEV_RGxJE8EON5OObdrp5Tp8JL0Fxb into /root/.genomic_benchmarks/human_enhancers_ensembl.zip... Done.
Unzipping...

/usr/local/lib/python3.7/dist-packages/google_drive_downloader/google_drive_downloader.py:78: UserWarning: Ignoring `unzip` since "1gZBEV_RGxJE8EON5OObdrp5Tp8JL0Fxb" does not look like a valid zip file
  warnings.warn('Ignoring `unzip` since "{}" does not look like a valid zip file'.format(file_id))

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

[<ipython-input-18-ed514f54c2bb>](https://localhost:8080/#) in <module>()
      1 from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersEnsembl
      2 
----> 3 X_train = HumanEnhancersEnsembl(split="train", version=0)

2 frames

[/usr/local/lib/python3.7/dist-packages/genomic_benchmarks/dataset_getters/pytorch_datasets.py](https://localhost:8080/#) in HumanEnhancersEnsembl(split, force_download, version)
     77 
     78 def HumanEnhancersEnsembl(split, force_download=False, version=None):
---> 79     return GenomicClfDataset("human_enhancers_ensembl", split, force_download, version)
     80 
     81 

[/usr/local/lib/python3.7/dist-packages/genomic_benchmarks/dataset_getters/pytorch_datasets.py](https://localhost:8080/#) in __init__(self, dset_name, split, force_download, version)
     36         label_mapper = {}
     37 
---> 38         for i, x in enumerate(base_path.iterdir()):
     39             label_mapper[x.stem] = i
     40 

[/usr/lib/python3.7/pathlib.py](https://localhost:8080/#) in iterdir(self)
   1105         if self._closed:
   1106             self._raise_closed()
-> 1107         for name in self._accessor.listdir(self):
   1108             if name in {'.', '..'}:
   1109                 # Yielding a path object for these makes little sense

FileNotFoundError: [Errno 2] No such file or directory: '/root/.genomic_benchmarks/human_enhancers_ensembl/train'

The same problem happens for split="test".

Specifically, the problem with unziping is present also for this code:

from genomic_benchmarks.loc2seq import download_dataset
download_dataset("human_enhancers_ensembl", version=0)

However, when I set use_cloud_cache=False the dataset is downloaded (default value is True).
So it seems, there is some problem with cloud cache for this dataset.

Downloading other datasets works fine.

Datasets not found

I have installed this package and but I can't load the datasets.

My code is as follows:

from genomic_benchmarks.data_check import list_datasets
from genomic_benchmarks.dataset_getters.pytorch_datasets import get_dataset
from genomic_benchmarks.data_check import info
from genomic_benchmarks.loc2seq import download_dataset

When trying to download, for example, 'demo_coding_vs_intergenomic_seqs' I get FileNotFoundError: Dataset demo_coding_vs_intergenomic_seqs not found.

For completion's sake, I wrote code to attempt to download each of the datasets.

for dset in list_datasets():
    try:
        get_dataset(dset, split='train')
        print("success!")
    except:
        print(dset, "not found")

The output is as follows:

demo_coding_vs_intergenomic_seqs not found
human_enhancers_cohn not found
human_ocr_ensembl not found
demo_human_or_worm not found
human_ensembl_regulatory not found
drosophila_enhancers_stark not found
dummy_mouse_enhancers_ensembl not found
human_enhancers_ensembl not found
human_nontata_promoters not found

The same occurs with the info and download_dataset functions as well. Any help on what I'm doing wrong would be appreciated.

Enhancers_cohn labels incorrect?

Hi, I was wondering, is there a chance the negative and positive examples are actually switched around by accident?

I trained a model and got a little above what was reported in the paper. However, when I applied the model to sequences I had that were non enhancers (and negatives), I got opposite prediction, pretty much exactly as the same performance during training, but flipped.

For example, getting 70% during training on the GenomicsBenchmark dataset. And then taking that model, and predicting on my own enhancer sequences, I got 70% if I actually switch the labels. Conversely, I get 30% when I used the labels as provided in the GenomicsBenchmark datset.

Any thoughts? Thank you.

Info on full-set dataset

Currently, info provides information based only on interval type format of the dataset. It would be useful if the info also provide a summary about sequences (like GC-content).

Longer intervals?

Hi, thanks so much for providing this dataset! I will definitely cite your work :)

I was wondering, is there any way to increase the intervals of the sequences? If not, I don't suppose you know of other datasets that may have longer sequences? I'd like to test my new model that can capture longer contexts (sequences). Thanks so much!

Eric

Add a possibility to a custom dataset

Given BED file or several BED files, provide a function that would convert this into interval-type dataset (i.e. convert BED file into gzipped CSV file and do train/test split). Optionally, randomly generate negative controls.

Force download doesn't work for pytorch datasets

The forced_download parameter in pytorch datasets does not force the dataset to be re-downloaded.

from genomic_benchmarks.dataset_getters.pytorch_datasets import HumanEnhancersCohn

train_dset =  HumanEnhancersCohn('train', version=0, force_download=True)

The cause is probably that there is a check if the dataset is downloaded, and only after it the force_download parameter is checked. This is causing the force_download parameter to be irrelevant when the dataset is already downloaded.

if not is_downloaded(dset_name):
download_dataset(dset_name, version=version, force_download=force_download)

Leaderboard

Hi, in the paper you mentioned there would be a public leaderboard for best results? Was wondering if I missed that somewhere, or it's not up?

I surpassed one of the benchmarks so far, just curious if there's a way to check state of the art. Thanks!

Eric

cloud_cache is not working

from genomic_benchmarks.dataset_getters.pytorch_datasets import get_dataset
train_dset = get_dataset('demo_human_or_worm', 'train')
Downloading 1Vuc44bXRISqRDXNrxt5lGYLpLsJbrSg8 into /root/.genomic_benchmarks/demo_human_or_worm.zip... /usr/local/lib/python3.7/dist-packages/genomic_benchmarks/utils/datasets.py:50: UserWarning: No version specified. Using version 0.
  warnings.warn(f"No version specified. Using version {metadata['version']}.")
Done.
Unzipping.../usr/local/lib/python3.7/dist-packages/google_drive_downloader/google_drive_downloader.py:78: UserWarning: Ignoring `unzip` since "1Vuc44bXRISqRDXNrxt5lGYLpLsJbrSg8" does not look like a valid zip file
  warnings.warn('Ignoring `unzip` since "{}" does not look like a valid zip file'.format(file_id))

Permission Error on HumanOrWorm Dataset

This should be a simple fix of updating human or worm to Anyone with the link

from genomic_benchmarks.dataset_getters.pytorch_datasets import DemoHumanOrWorm
dset = DemoHumanOrWorm(split='train')
Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1JW0-eTB-rJXvFcglqBo3pFZi1kyIWC3X 

download_dataset doesn't work as expected on my CHPC system.

I was able to get download_dataset to work as expected on my macbook pro but when I try to use it on my university's CHPC system I get this error.

Do you have any idea what could be causing this? I am trying to repeat the experiments on the HyenaDNA paper and their code depends upon this function working properly.

(p100_hyena-dna) [u1323098@kp359:test_dir]$ python
Python 3.8.18 | packaged by conda-forge | (default, Oct 10 2023, 15:44:36)
[GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from genomic_benchmarks.data_check import list_datasets
list_datasets()
['drosophila_enhancers_stark', 'dummy_mouse_enhancers_ensembl', 'human_ensembl_regulatory', 'demo_coding_vs_intergenomic_seqs', 'demo_human_or_worm', 'human_nontata_promoters', 'human_enhancers_ensembl', 'human_enhancers_cohn', 'human_ocr_ensembl']
from genomic_benchmarks.loc2seq import download_dataset
download_dataset("human_nontata_promoters", version=0)
Traceback (most recent call last):
File "", line 1, in
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/genomic_benchmarks/loc2seq/loc2seq.py", line 55, in download_dataset
return download_from_cloud_cache((dataset_name, version), Path(dest_path) / dataset_name)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/genomic_benchmarks/loc2seq/cloud_caching.py", line 32, in download_from_cloud_cache
gdown.download(
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/gdown/download.py", line 259, in download
filename_from_url = m.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'

More diverse datasets needed

Currently, all datasets are binary (2 categories) and either balances or almost balanced. It would be great to include

  • a dataset that has more than two categories
  • is unbalanced

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.