conda create -n gfm python=3.11 -y && conda activate gfm
# Install PyTorch (adjust to your CUDA version accordingly)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install DNABert
cd DNABERT && pip install -e . && cd -
# Install reset of the dependencies
pip install -r requirements -U
All processed data can be downloaded from this Zenodo link:
wget https://zenodo.org/records/10701018/files/processed.zip?download=1 -O data/processed.zip
unzip data/processed.zip -d data/
If you would like to process all data from scratch, start by runing the the two notebooks listed below, which downloads the DHSs data and extract exclusive peaks along with their corresponding class labels. The two notebooks can take up to an hour to run depending on your internet connection and compute resource.
notebooks/master_dataset.ipynb
notebooks/filter_master.ipynb
Next, run the processing scripts to convert the sequence into k-mer features and other numerical features.
sh run/prepare_data.sh
sh run/run_all_exps.sh