Giter Site home page Giter Site logo

transepi's Introduction

TransEPI: Capturing large genomic contexts for accurately predicting enhancer-promoter interactions

The codes and datasets for Capturing large genomic contexts for accurately predicting enhancer-promoter interactions.

TransEPI is a Transformer-based model for EPI prediction. This repository contains the scripts, data, and trained models for TransEPI.

TransEPI

Requirements

  • numpy
  • tqdm
  • scikit-learn
  • PyTorch>=1.9.0 (recommended) or PyTorch 1.6.0+
  • pyBigWig (optional, required by prepare_bw_signals.py for preparing features)

Datasets

All the datasets used in this study are available at data/BENGI and data/HiC-loops.

Quickstart

Quickstart is a guide for using the pre-trained models provided in models. To train models on custom datasets, please refer to the ``step-by-step guide'' in the next section.

  1. Clone the codes:
git clone [email protected]:biomed-AI/TransEPI.git
  1. Download processed genomic features
  • Download the genomic features from Synapse:syn26156164
  • Edit the feature configuration file ./data/genomic_data/CTCF_DNase_6histone.500.json to specifiy the locations of the genomic feature files downloaded from Synapse. Absolute path is required!
  1. Run the model
cd TransEPI/src
python ./evaluate_model.py \
	-t ../data/BENGI/HMEC.HiC-Benchmark.v3.tsv.gz \  # samples to be predicted
	-c ../models/TransEPI_EPI.json \                 # configuration file
        --gpu 0 \                                        # GPU ID, set it to -1 to use CPU
	-m ../models/TransEPI_EPI_valHMEC.pt \           # model file
	-p output                                        # prefix of the output

The predictions will be available at output.prediction.txt

Step-by-step guide

Prepare genomic data

For cell types not included in Synapse:syn26156164

  1. Download the genomic data required by TransEPI from ENCODE or Roadmap

    • CTCF ChIP-seq data in narrowPeak format
    • DNase-seq data in bigWig format
    • H3K27me3, H3K36me3, H3K4me1, H3K4me3, and H3K9me3 ChIP-seq data in bigWig format
  2. Edit TransEPI/data/genomic_data/bed/CTCF_bed.json and TransEPI/data/genomic_data/bigwig/bw_6histone.json to specify the location of the narrowPeak and bigWig files

  3. Convert narrowPeak and bigWig signals to .pt files

cd TransEPI/data/genomic_data
bash ./pipeline.sh
  1. Add the .pt files generated by step 3 to TransEPI/data/genomic_data/CTCF_DNase_6histone.500.json

Summary

  • The location of raw narrowPeak and bigwig files should be specified in TransEPI/data/genomic_data/bed/CTCF_bed.json and TransEPI/data/genomic_data/bigwig/bw_6histone.json.
  • The processed data files should be specified in TransEPI/data/genomic_data/CTCF_DNase_6histone.500.json

Prepare the configuration file for model training

The configuration file should be in .json format:

{
    "data_opts": {	// dataset configuration
        "datasets": [ 	// datasets used to train the model (required by cross_validate.py and it will be ignored in evaluate_model.py)
            "../data/BENGI/GM12878.CTCF-ChIAPET-Benchmark.v3.tsv",
            "../data/BENGI/GM12878.HiC-Benchmark.v3.tsv",
            "../data/BENGI/GM12878.RNAPII-ChIAPET-Benchmark.v3.tsv",
            "../data/BENGI/HeLa.CTCF-ChIAPET-Benchmark.v3.tsv",
            "../data/BENGI/HeLa.HiC-Benchmark.v3.tsv",
            "../data/BENGI/HeLa.RNAPII-ChIAPET-Benchmark.v3.tsv"
        ],
        "feats_order": ["CTCF", "DNase", "H3K4me1", "H3K4me3", "H3K36me3", "H3K9me3",  "H3K27me3"], //  the order of features (do not change the order if you run the models provided by us)
        "feats_config": "../data/genomic_data/CTCF_DNase_6histone.500.json", // location of genomic data configuration file
        "bin_size": 500,        // bin sise used by TransEPI
        "seq_len": 2500000      // the size of large genomic context
    },

    "model_opts": {	// EPI model configuration
        "model": "TransEPI",
        "cnn_channels": [180],
        "cnn_sizes": [11],
        "cnn_pool": [10],
        "enc_layers": 3,
        "num_heads": 6,
        "d_inner": 256,
        "da": 64,
        "r": 32,
        "att_C": 0.1,
        "fc": [128, 64],
        "fc_dropout": 0.2
    },

    "train_opts": {	// model training configuration
        "learning_rate": 0.0001,
        "batch_size": 128,
        "num_epoch": 300,
        "patience": 10,
        "num_workers": 16,
        "use_scheduler": false
    }
}

Note: The comments marked with // are only used to illustrate the contents in the configuration file. They should not be included in the configuration file because the .json format does not support comments.

Prepare input files

The input to the TransEPI model should be formatted like:

1	572380.0	chr5	317258	317610	chr5:317258-317610|GM12878|EH37E0762690	chr5	889314	891314	chr5:889813-889814|GM12878|ENSG00000028310.13|ENST00000388890.4|-
0	100101.0	chr5	317258	317610	chr5:317258-317610|GM12878|EH37E0762690	chr5	216833	218833	chr5:217332-217333|GM12878|ENSG00000164366.3|ENST00000441693.2|-	316258-318610
0	100101.0	chr5	317258	317610	chr5:317258-317610|GM12878|EH37E0762690	chr5	216833	218833	chr5:217332-217333|GM12878|ENSG00000164366.3|ENST00000441693.2|-	316258-318610;416258-418610

The input file should be tab separated and the fields are:

1. label: for datasets without known labels, set it to 0
2. distance: the between the enhancer and the promoter
3. e_chr: enhancer chromosome
4. e_start: enhancer start
5. e_end: enhancer end
6. e_name: enhancer name, cell type is required be noted in the second field in enhancer name (seperated by `|`): e.g. chr5:317258-317610|GM12878|EH37E0762690
7. p_chr: promoter chromosome
8. p_start: promoter start
9. p_end: promoter end
10. p_name: promoter name
11. mask region (optional): the feature values in the mask regions will be masked (set to 0)

Train the model

  • cross validation
python cross_validate.py \
	-c config.json \  # the json file prepared in "Preparing the configuration file for training model"
	--gpu 0 		  # GPU ID, set it to -1 to use CPU
	-o outdir 		  # output directory

Models

The trained models are available at models.

Reproducibility

To reproduce the major results shown in the manuscripts, see dev/run_cv.sh (cross validation) and dev/run_pred.sh (evaluation).

Experimental feature

Replace epi_dataset.py with the src/epi_variable_dataset.py to enable TransEPI supporting variable length input

Questions

For questions about the datasets and code, please contact [email protected] or create an issue.

Citation

Ken Chen, Huiying Zhao, Yuedong Yang, Capturing large genomic contexts for accurately predicting enhancer-promoter interactions, Briefings in Bioinformatics, 2022;, bbab577, https://doi.org/10.1093/bib/bbab577

transepi's People

Contributors

chenkenbio avatar

Stargazers

Dhruva Rajwade avatar  avatar Zhichao Tan avatar Awnj avatar Wenxuan Yan 闫文轩 avatar 糕糕 avatar wx-cie avatar Fengqi Zhong avatar Zhonghao avatar

Watchers

 avatar  avatar 糕糕 avatar

transepi's Issues

How long was training time and on what architecture?

Hi,
I am thinking about retraining the model on slightly different data (one-hot encoding of DNA). This would require me to add an initial CNN layer. I am wondering how long this might take to train.

Could you specify how long it took you to train the initial model (time) as well as the number of epochs. Moreover, it would be great if you could mention the compute you were using (CPU or GPU and which kind of processor)

Thanks a lot!

best
Dominique

Huvec data is missing at synapse

Hi,
in the readme section "Quickstart" you mention that the data has to be downloaded from synapse.org and that the filepaths need to be replaced with absolute file paths in the file ./data/genomic_data/CTCF_DNase_6histone.500.json.

In the file ./data/genomic_data/CTCF_DNase_6histone.500.json there is a group of files referred to as "HUVEC". While, these files appear by name at Synapse landing page there is no folder in the files section which contains these files.

Could you provide a link to download these files to run and reproduce the model?

Thanks a lot!

Screenshot 2022-09-28 at 17 02 57

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.