Giter Site home page Giter Site logo

coda's Introduction

Coda: a convolutional denoising algorithm for genome-wide ChIP-seq data

Coda uses convolutional neural networks to learn a mapping from noisy to high-quality ChIP-seq data. These trained networks can then be used to remove noise and improve the quality of new ChIP-seq data. For more details, please refer to our paper

Koh PW, Pierson E, Kundaje A, Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics (2017) 33 (14): i225-i233 URL:https://doi.org/10.1093/bioinformatics/btx243 (ISMB 2017 Proceedings)

bioRxiv doi: https://doi.org/10.1101/052118

Dependencies

The code is written in Python 2.7 and requires the following Python packages to run:

  • Numpy (1.11.1)
  • Scipy (0.18.0)
  • Scikit-learn (0.17.1)
  • Pandas (0.18.1)
  • h5py (2.6.0)
  • rpy2 (2.8.1)
  • Keras (1.0.7)

In addition, if you want to process your own data, you will need:

  • AQUAS ChIP-seq pipeline
  • SAMtools (1.2)
  • BEDtools (2.23)
  • ucsc_tools (3.0.9)

Training and testing a model with pre-processed data

The fastest way to get started is to download data that has already been pre-processed. We have uploaded processed ChIP-seq data from lymphoblastoid cell lines GM12878 and GM18526, taken from [1]. Each cell line has two sets of ChIP-seq data, one derived from 1M reads per mark and the other from 100M+ reads per mark. The instructions below will train a model to recover high-depth data from low-depth data on GM12878, and then apply it to low-depth data on GM18526, evaluating the model output against high-depth data on GM18526:

  1. Clone the repo and install the dependencies above.

  2. Edit diConstants.py to reflect the paths where you want to store the data, code, results, etc.

  3. Run setup.py. This runs a few test imports to make sure you have the required libraries, and sets up the directory structure as specified in diConstants.py.

  4. Run copyData.py. This copies the required data (including hg19 blacklist and chromosome sizes) to the appropriate folders. Note that the data is 6GB in size, so please run this script in a location where there's enough space!

  5. Finally, run python runGMExperiments.py to get the experiments going. Numerical results will be written to RESULTS_ROOT. Output tracks (reconstructed signal and peak calls) will be written to RESULTS_BIGWIG_ROOT. We make use of the R 'PRROC' package, written by Jan Grau and Jens Keilwagen, to evaluate peak calls.

Processing your own data

We use the AQUAS ChIP-seq pipeline (https://github.com/kundajelab/TF_chipseq_pipeline) to process raw ChIP-seq data. The script prepData.py (and the contents of the scripts folder) contains wrapper functions that call the AQUAS pipeline for you.

Please install the AQUAS pipeline before proceeding. Note that this pipeline is still under some development and might be changing in non-backwards-compatible ways. Our code has been tested with commit 7b7dd27d42d46ac52f5687f80904c576d1b6595d of the AQUAS pipeline.

To create the processed data that we provided above, you may run the following steps:

  1. Follow steps 1-3 of the above section.

  2. Download the files corresponding to GM12878 and GM18526: http://gbsc-share.stanford.edu/chromovar/rawdata/mapped/bam/personal/reconcile/dedup/

  3. Run python prepData.py make_intervals hg19. You only need to do this once.

  4. Run python prepData.py run_GM_pipeline.

This code assumes that you've downloaded the files to a shared location (REMOTE_ROOT, specified in diConstants.py). It makes copies of the files in a local directory, RAW_ROOT, before proceeding. This setup is useful if REMOTE_ROOT is shared across multiple machines and RAW_ROOT is local to the machine that you're running the code on, because there will be a lot of IO operations that will be faster if done locally. If you do not need this, modify merge_BAMs() in prepData.py to remove the copying.

To process your own data, simply modify the paths in diConstants.py or copy your data to the right directories. While we start from BAM files in this example, the AQUAS pipeline can start from a variety of input files (e.g., FASTQ, tagAligns). Edit scripts/getSignalTrack.sh and scripts/findPeaks.sh if you want to change the parameters that are passed into AQUAS.

Contact

If you have any questions, please contact:

References

[1] Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, et al. Extensive variation in chromatin states across humans. Science (New York, NY). 2013 11;342(6159):750โ€“2

coda's People

Contributors

akundaje avatar kohpangwei avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.