Giter Site home page Giter Site logo

scottclowe / barcodebert Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kari-genomics-lab/barcodebert

0.0 1.0 0.0 27.71 MB

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

License: MIT License

Shell 1.75% Python 98.25%

barcodebert's Introduction

BarcodeBERT

A pre-trained transformer model for inference on insect DNA barcoding data.

drawing

Model weights

4-mers
5-mers
6-mers

Reproducing the results from the paper

  1. Clone this repository and install the required libraries by running
pip install -e .
  1. Download the data
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip
CNN model

Training:

cd scripts/CNN/
python 1D_CNN_supervised.py

Evaluation:

python 1D_CNN_genus.py
python 1D_CNN_Linear_probing.py
BarcodeBERT

Model Pretraining:

cd scripts/BarcodeBERT/
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=4 --stride=4
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=5 --stride=5
python MGPU_MLM_train.py --input_path=../../data/pre_training.tsv --k_mer=6 --stride=6

Evaluation:

python MLM_genus_test.py 4
python MLM_genus_test.py 5
python MLM_genus_test.py 6

python Linear_probing.py 4
python Linear_probing.py 5
python Linear_probing.py 6

Model Fine-tuning To fine-tune the model, you need a folder with three files: "train," "test," and "dev." Each file should have two columns, one called "sequence" and the other called "label." You also need to specify the path to the pre-trained model you want to use for fine-tuning, using "pretrained_checkpoint_path".

python Fine-tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=4 --stride=4
python Fine_tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=5 --stride=5
python Fine_tuning.py --input_path=path_to_the_input_folder --Pretrained_checkpoint_path path_to_the_pretrained_model --k_mer=6 --stride=6
DNABERT

To fine-tune the model on our data, you first need to follow the instructions in the DNABERT repository original repository to donwnload the model weights. Place them in the dnabert folder and then run the following:

cd scripts/DNABERT/
python supervised_learning.py --input_path=../../data -k 4 --model dnabert --checkpoint dnabert/4-new-12w-0
python supervised_learning.py --input_path=../../data -k 6 --model dnabert --checkpoint dnabert/6-new-12w-0
python supervised_learning.py --input_path=../../data -k 5 --model dnabert --checkpoint dnabert/5-new-12w-0
DNABERT-2

To fine-tune the model on our dataset, you need to follow the instructions in DNABERT2 repository for fine-tuning the model on new dataset. You can use the same input path that is used for fine-tuning BarcodeBERT as the input path to DNABERT2.

Citation

If you find BarcodeBERT useful in your research please consider citing:

@misc{arias2023barcodebert,
  title={{BarcodeBERT}: Transformers for Biodiversity Analysis},
  author={Pablo Millan Arias
    and Niousha Sadjadi
    and Monireh Safari
    and ZeMing Gong
    and Austin T. Wang
    and Scott C. Lowe
    and Joakim Bruslund Haurum
    and Iuliia Zarubiieva
    and Dirk Steinke
    and Lila Kari
    and Angel X. Chang
    and Graham W. Taylor
  },
  year={2023},
  eprint={2311.02401},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2311.02401},
}

barcodebert's People

Contributors

atwang16 avatar millanp95 avatar msafari18 avatar niousha12 avatar scottclowe avatar zmgong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.