Giter Site home page Giter Site logo

celfie's Introduction

CELFIE (CELl Free dna decomposItion Expectation maximization)

Overview

Expectation maximization algorithm to decompose complex mixtures of cfDNA into the tissues generating the cfDNA fragments. Required input is methylated cfDNA, either WGBS or Illumina 450K with downsampled read counts, and a reference panel of tissues to estimate their contribution to the cfDNA. CelFiE can estimate an arbitrary number of unknown or missing tissues from your reference that are truly in the cfDNA mixtures.

Preparing Data

CelFiE expects the methylation data for a cfDNA individual or reference cell type is in the form of # of methylated reads, # of total reads. For example for one sample, the file could like like:

# CHR START END METH  DEPTH   
chr1	186917271	186917772	446.0	630.0
chr15	92708070	92708571	71.0	133.0
chr14	55296905	55297406	89.0	115.0

If using the provided TIMs (in data/reference_file_regions.txt) the chrom, start, end would be a 500bp region around the TIM. To obtain the summed reads per sample, the file sum_by_list.py is provided. First, all CpGs within the specified TIM region can be found using bedtools. Then, all reads can be summed using sum_by_list.py. If a region has no coverage, meaning that region is 'missing', sum_by_list.py will return 0 methylated, 0 depth, for that region.

bedtools intersect -a <sample file> -b <data/reference_file_regions.txt" > <output> "
python sum_by_list.py <data/reference_file_regions.txt" >  <output>  <output summed> 1  # run for one sample

All cfDNA and reference data should then be compiled into one input tab separated file for CelFiE, with one set of bed columns before the sample data, and one set before the reference data. For example, for 2 input individuals and 4 reference cell types would look like

chr1	186917271	186917772	446.0	630.0	230.0	304.0	chr1	186917271	186917772	1156.062	1196.0	3.968	224.0	1172.14	1234.0	782.018	852.0
chr15	92708070	92708571	71.0	133.0	68.0	94.0 chr1	23291168	23291669	341.02	352.0	2.994	87.0	332.996	359.0	295.99	314.0
chr14	55296905	55297406	89.0	115.0	74.0	83.0  chr1	48159309	48159810	168.987	173.0	1.0	39.0	182.98	196.0	94.00	119.0

To prepare CelFiE files from Bismark output, see prepare_bismark.sh

Code

Anaconda environment file specified in celfie_conda_env.yml

Full implementation of the EM model at EM/em.py. Code to generate simulations can be found in EM/simulations.

Run EM script as follows:

python EM/em.py <input_file> <output_directory> <num of cfDNA samples> <max EM iterations> <num of unknown categories> <parallel job ID> <convergence> <num of random restarts per replicate>

EX: python EM/em.py data/sample_data.txt EM/sample_output 15 1000 1 1 0.001 1

To run many parallel replicates on a SGE or UGE cluster configuration, see run_real_data.sh qsub run_real_data.sh

Currently, the estimated methylation proportions for the reference and the estimated cell type proportions are output in pickled python numpy arrays. These can be read back into python for further analysis by the following

import pickle as pkl
cell_proportions = pkl.load(open("EM/sample_output/1_alpha.pkl", "rb"))
methylation_proportions = pkl.load(open("EM/sample_output/1_gamma.pkl", "rb"))

Citation

For more details on CelFiE, see:

Christa Caggiano, Barbara Celona, Fleur Garton, Joel Mefford, Brian Black, Catherine Lomen-Hoerth, Andrew Dahl, Noah Zaitlen, "Estimating the rate of cell type degeneration from epigenetic sequencing of cell-free DNA", BioRxiv, Jan 2020, DOI

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.