CAESAR

CAESAR (Chromosomal structure And EpigenomicS AnalyzeR) is a deep learning approach to predict nucleosome-resolution 3D chromatin contact maps from existing epigenomic features and lower-resolution Hi-C contact maps.

test testtest

Ref: Feng, F., Yao, Y., Wang, X. Q. D., Zhang, X., & Liu, J. (2020). Connecting high-resolution 3D chromatin organization with epigenomics. bioRxiv.

Required environment

Python-3.6.8
numpy-1.17.4
scipy-1.4.1
tensorflow-2.4.1
matplotlib-3.0.2
pyBigWig-0.3.17
seaborn-0.9.0
pandas-0.23.4

All these dependencies can be installed with "pip install" or "conda install" command within several minutes.

Using CAESAR model

In /Imputation/, we provide some examples of quickly applying the trained model to impute the region that you are interested in.

i. 01_generate_example_input_regions.py

Before this step, users need to get Hi-C files and epigenomic features ready. The example data can be downloaded at DropBox (link: https://www.dropbox.com/sh/wn39g1y4tinwy3z/AADDt7cIANS12izm-Md133LGa?dl=0). Another backup link is: https://www.dropbox.com/sh/wf91igatogz2fte/AACyIPKDU4deTBJMZY2hP-_na?dl=0

This .py file generate numpy matrices from Hi-C contact maps and epigenomic features for the next steps.

The parameters users need to specify (from Line 152 of this file):

impute_cell_name: the cell to impute (must match the file name of bigWig/bedGraph files)
processed_hic_path: the folders containing input Hi-C in .txt format (generated with JuiceTools dump). File names must be chr{?}_1kb.txt.
bigwig_path: the folder containing all bigWig/bedGraph files. File names must be {cell_name}_{epi_name}_hg38.bigWig/bedGraph.
epi_names: ['ATAC_seq', 'CTCF', 'H3K4me1', 'H3K4me3', 'H3K27ac', 'H3K27me3']
ch_coord: The regions you need to impute (must >= 250Kb)
HiC_cell_lines: The Hi-C contact maps to use
temp_folder: temp folder for storage

>>> python 01_generate_example_input_regions.py
...
Step 1: Load HiC data and store into strata
  chr2 HFF
Step 2: Load epigenomic data and store into arrays
  DNase_seq
  CTCF
  H3K4me1
  H3K4me3
  H3K27ac
  H3K27me3

ii. 02_example_region_from_model.py

This .py file generate 250 Kb regions (Hi-C and epi) from the temp outputs of the last step and the use CAESAR model to impute 200-bp-resolution Micro-C contact maps. If the given region is longer than 250 Kb, it will impute multiple contact maps with a 50 Kb step length.

The parameters users need to specify (from Line 211 of this file):

surrogate: whether the input Hi-C is a surrogate Hi-C. If not, it will use the model trained with a matched Hi-C (HFF in this example)

The other parameters have to be consistent with 01_generate_example_input_regions.py.

>>> python 02_example_region_from_model.py
...
Finish generating positional encoding... 0.0min:0.0037298202514648438sec
Finish generating epigenomic features... 0.0min:0.38011598587036133sec
 Loading HiC strata: chr2 23850000 24100000
Finish loading HiC strata... 0.0min:0.41601085662841797sec
 Converting to matrices: chr2
Finish processing 1 HiC maps... 0.0min:0.12031936645507812sec
Start running:
Epoch: 1  Batch: 1
chr2_23850000

iii. 03_visualize_results.py

This .py file visualize all predicted contact maps.

The parameters users need to specify (from Line 137 of this file): All parameters have to be consistent with 01_generate_example_input_regions.py.

>>> python 03_visualize_results.py
...
chr2 23850000

The results are stored in /temp/example_outputs/ as .png files:

(Interpolated) Input HiC in this region:

The predicted region chr2:23,850,000-24,100,000:

Comparing with the ground truth - real Micro-C contact map:

ii. example_region_from_model.py

In this step, the extracted data are fed into the model to generate predicted contact maps. The outputs are visualized in /Model/example_outputs/.

Train CAESAR from scratch

In /Training/, we provide the detailed code of training CAESAR from scratch.

The detailed step-to-step code is in /Training/README.md.

Processing data

Before training, we need to prepare files for

Hi-C contact maps,
Micro-C contact maps
epigenomic features

The required input format for Hi-C and Micro-C contact maps are .txt files generated by dump function of JuiceTools. The resolution of the Micro-C contact map should be 200 bp. Each .txt file corresponds to one chromosome in the format of "position1 - position2 - contacts". E.g.,

10000 10000 3
10000 10200 8
10000 180600 4
...

The required input format for epigenomic tracks are .bigWig or .bedGraph format. Each line in the .bedGraph format should be "chr - start_pos - end_pos - score". E.g.,

chr1    10500   10650   0.157
chr1    10650   10850   0.126
chr1    11050   11100   0.347
...

In /Training/a_data_processing/, we provide the code for loading large contact maps into strata and load epigenomic tracks into numpy arrays. All processed files will take about 5 GB of storage for one cell line.

Model training

CAESAR's inputs include a lower-resolution Hi-C contact map and a number of histone modification features (e.g., H3K4me1, H3K4me3, H3K27ac and H3K27me3), chromatin accessibility (e.g., ATAC-seq), and protein binding profiles (e.g., CTCF). CAESAR captures the Hi-C contact map as a graph G with nodes representing genomic regions of 200 bp long, weighted edges representing chromatin contacts between the regions, and N epigenomic features modeled as N-dimensional node attributes. The architecture of CAESAR includes ordinary 1D convolutional layers which extract local epigenomic patterns along the 1D chromatin fiber, and graph convolutional layers which extract spatial epigenomic patterns over the neighborhood specified by G.

Due to high computational burden, it is impossible to feed the entire contact map into the memory, and therefore we used a 250 kb sliding window with 50 kb step length along the diagonal (e.g., 0-250,000; 50,000-300,000; 100,000-350,000; ...) to select the regions and fed them one by one into the model.

In /Training/b_model_training/, we provide the code for splitting large contact maps into 250 kb regions and training CAESAR model with the processed data. We recommend using GPU to train the model, and it takes about 10 hours.

wu99271053 / caesar Goto Github PK

caesar's Introduction

CAESAR

test testtest

Required environment

Using CAESAR model

i. 01_generate_example_input_regions.py

ii. 02_example_region_from_model.py

iii. 03_visualize_results.py

ii. example_region_from_model.py

Train CAESAR from scratch

Processing data

Model training

caesar's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent