Giter Site home page Giter Site logo

celfie's Introduction

CelFiE

DOI

Overview

CelFiE (CELl Free dna decomposItion Expectation maximization) is an expectation maximization algorithm that takes as input, a reference panel and cell-free DNA methylation of several individuals. From this data, CelFiE will estimate the contribution of the reference tissues to the cfDNA of each individual, along with an arbitrary number of missing tissues not contained in the reference data.

For more details, please see our paper.

Installation

To install CelFiE, clone or fork this repository:

git clone https://github.com/christacaggiano/celfie.git.

All required packages can be installed using Anaconda using the environment file specified in celfie_conda_env.yml. Run:

conda env create -f celfie_conda_env.yml -n celfie_env

CelFiE was developed in Python 3.7.

TL;DR Running CelFiE

To run CelFiE:

python scripts/celfie.py <input_path> <output_directory> <num_samples> <--max_iterations> <--unknowns> <--parallel_job_id <--convergence> <--random_restarts>

For a detailed description of the parameters, see below. To run a test run of CelFiE with the default parameters run:

python scripts/celfie.py celfie_demo/sample_data.txt celfie_demo/sample_output 15 

CelFiE demo

Sample data is provided in celfie_demo/, along with a sample Jupyter Notebook for analyzing the output demo.ipynb.

Details of Running CelFiE

Preparing Data

CelFiE expects the methylation data to be in the form # of methylated reads, # of total reads. For example it could look like:

CHR   START END METH DEPTH
chr1	10	11	44.0	63.0
chr1	50	51	71.0	133.0
chr1	60	61	89.0	115.0

CelFiE should work, in theory, on Illumina Chip data, if you estimate the read depth of each of the sites. However, we do not officially recommend this.

The input of CelFiE is a single txt file including both the reference data and the cfDNA, with a header indicating sample names (see celfie_demo/sample_data.txt). Essentially the file is the reference and cfDNA sample bed files combined. This data should look something like this:

CHROM START END SAMPLE1_METH SAMPLE1_DEPTH CHROM START END TISSUE1_METH TISSUE1_DEPTH
chr1	10	11	44.0	63.0  chr1	10	11	25.0	29.0
chr1	50	51	71.0	133.0 chr1	50	51	85.0	99.0
chr1	60	61	89.0	115.0 chr1	60	61	92.0	117.0

Code

EM Script

After preparing data as above, you can run EM script as follows:

python scripts/celfie.py <input_path> <output_directory> <num_samples> <--max_iterations> <--unknowns> <--parallel_job_id <--convergence> <--random_restarts>

CelFiE takes several parameters. Input_path, output_directory, and num_samples are the only mandatory parameters.

usage: em.py [-h] [-m MAX_ITERATIONS] [-u UNKNOWNS] [-p PARALLEL_JOB_ID]
             [-c CONVERGENCE] [-r RANDOM_RESTARTS]
             input_path output_directory num_samples

CelFiE - Cell-free DNA decomposition. CelFie estimated the cell type of origin
proportions of a cell-free DNA sample.

positional arguments:
  input_path            The path to the input file
  output_directory      The path to the output directory
  num_samples           Number of cfdna samples

optional arguments:
  -h, --help            show this help message and exit
  -m MAX_ITERATIONS, --max_iterations MAX_ITERATIONS
                        How long the EM should iterate before stopping, unless
                        convergence criteria is met. Default 1000.
  -u UNKNOWNS, --unknowns UNKNOWNS
                        Number of unknown categories to be estimated along
                        with the reference data. Default 1. Can be increased to 2+ for large samples. 
  -p PARALLEL_JOB_ID, --parallel_job_id PARALLEL_JOB_ID
                        Replicate number in a simulation experiment. Default
                        1.
  -c CONVERGENCE, --convergence CONVERGENCE
                        Convergence criteria for EM. Default 0.001.
  -r RANDOM_RESTARTS, --random_restarts RANDOM_RESTARTS
                        CelFiE will perform several random restarts and select
                        the one with the highest log-likelihood. Default 10.

Output

CelFiE will output the tissue estimates for each sample in your input - i.e. the proportion of each tissue in the reference making up the cfDNA sample. See celfie_demo/sample_output/1_tissue_proportions.txt for an example of this output.

        tissue1 tissue2 .... unknown
sample1 0.05 0.08 .... 0.1
sample2 0.7 0.12 .... 0.2

CelFiE also outputs the methylation proportions for each of the tissues plus however many unknowns were estimated. This output will look like this:

      tissue1  tissue2 ... unknown
CpG1  0.99 1.0 ... 0.3
CpG2  0.45 0.88 ... 0.1

Sample code for processing both of these outputs can be seen in demo.ipynb.

L1 projection method

We also developed a method to project estimates onto the L1 ball, based on Duchi et al 2008. The code for this method is available at scripts/projection.py. It can be ran as

python projection.py <output_dir> <replicate> <number of tissues> <number of sites> <number of individuals> <input depth> <reference depth> <tissue_proportions.pkl>

Sample tissue proportions are included at scripts/simulations/unknown_sim_0201_10people.pkl.

Tissue Informative Markers

In our paper, we identified a set of tissue informative markers (TIMs). We claim that these are a good set of CpGs to use for decomposition.

Pre-selected TIMs

TIMs are available at TIMs/sample_tims.txt for individual CpG TIMs, and TIMs/sample_tims_summed.txt for reads summed +/-250bp around a TIM. We recommend using the TIMs/sample_tims_summed.txt for improved decomposition performance.

The TIMs represent markers for the following tissues:

  • dendritic cells
  • endothelial cells
  • eosinophils
  • erythroblasts
  • macrophages
  • monocytes
  • neutrophils
  • placenta
  • T-cells
  • adipose
  • brain
  • fibroblasts
  • heart left ventricle
  • hepatocytes
  • lung
  • mammary gland
  • megakaryocytes
  • skeletal muscle myoblasts
  • small intestine

Data was retrieved from the ENCODE and Blueprint data portals. When available, two biological replicates per tissue were combined into one sample. The TIMs were then calculated on the combined sample.

Please note all data was converted to hg38 and all CpGs are reported as (Chrom, start, end), where the end position indicates the C in the CpG dinucleotide.

Selecting TIMs

Code to find TIMs is located at TIMs/tim.py. This code takes a reference bedfile of all the tissues you would like to calculate TIMs for as input. See TIMs/sample_input.txt.

The TIM code can be run as:

python tim.py <input file> <output file> <num of tim/tissue> <num of tissues> <depth filter> <nan filter>

The number of TIMs per tissue can be adjusted, but note that as the number of TIMs approaches the number of CpGs, the less informative that TIM will be for that tissue.

The depth filter only will consider CpGs that have a median depth across all tissues greater than a user specified value. This is to ensure that low-coverage CpGs do not get selected as TIMs. The NaN filter will only consider CpGs that have less than a user specified number of missing values. This is to ensure a TIM isn't selected for a tissue because it is one of the few tissues with data at that location. The number of tims/tissue can vary. We find that 100 is a good number, and note that as the number of TIMs increase, the lower quality the TIMs will be, since we are selecting the top most informative CpGs/tissue (in other words, the top 100 most informative CpGs for pancreas will by definition, be "better" than the top 500).

For the sample data provided, we suggest:

python tim.py sample_input.txt tim.txt 100 19 15 2

Combining Reads

In our paper, we found that summing all reads +/-250bp offered improved performance when decomposing. To do this for TIMs generated as output of tim.py, we provide a shell script TIMs/tim.sh to call TIMs and sum data.

This script can be updated to change the following parameters:

input_file=sample_input.txt
output_file=sample_tims.txt
summed_file=sample_tims_summed.txt
sum_window_size=500
number_tims=100
number_tissues=19
depth_filter=15
na_filter=2

The pipeline can then be ran as

./tim.sh

Figures

Jupyter notebooks to reproduce figures and statistical analyses for the final version of this manuscript can be found in paper_figures directory.

Acknowledgements

Thanks to Arya Boudaie for help with writing and reviewing this code and to Antoine Passemiers for their tremendous help in speeding up the EM calculation.

Contact

For any questions with this code, please contact [email protected]. I am happy to help and open to any suggestions. If you email me, though, please be nice, I'm trying my best :)

Citation

Christa Caggiano, Barbara Celona, Fleur Garton, Joel Mefford, Brian Black, Catherine Lomen-Hoerth, Andrew Dahl, Noah Zaitlen, "Comprehensive cell type decomposition of circulating cell-free DNA with CelFiE", Nature Communications, May 2021, link

celfie's People

Contributors

christacaggiano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

celfie's Issues

Original error was: No module named 'numpy.core._multiarray_umath'

Hello,

I'm trying to run CelFiE from the celfie_env conda enviroment

Command:
python /path/celfie/TIMs/tim.py input_path output 100 27 15 2

And I get the following output:

Traceback (most recent call last):
  File "/usr/local/hurcs/miniconda3/envs/python-3.7/lib/python3.7/site-packages/numpy/core/__init__.py", line 22, in <module>
    from . import multiarray
  File "/usr/local/hurcs/miniconda3/envs/python-3.7/lib/python3.7/site-packages/numpy/core/multiarray.py", line 12, in <module>
    from . import overrides
  File "/usr/local/hurcs/miniconda3/envs/python-3.7/lib/python3.7/site-packages/numpy/core/overrides.py", line 7, in <module>
    from numpy.core._multiarray_umath import (
ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cs/icore/ekushele/celfie/TIMs/tim.py", line 2, in <module>
    import numpy as np
  File "/usr/local/hurcs/miniconda3/envs/python-3.7/lib/python3.7/site-packages/numpy/__init__.py", line 150, in <module>
    from . import core
  File "/usr/local/hurcs/miniconda3/envs/python-3.7/lib/python3.7/site-packages/numpy/core/__init__.py", line 48, in <module>
    raise ImportError(msg)
ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.9 from "/cs/icore/ekushele/miniconda3/envs/celfie_env/bin/python"
  * The NumPy version is: "1.21.3"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'

My conda_env.yml file (I upgraded numpy to 1.21.3):

name: celfie
channels:
  - defaults
dependencies:
  - numpy=1.21.2
  - pandas
  - bottleneck
  - seaborn


A code error in tim.sh

Hi Christa,

Thank you for developing CelFiE.

When I use the tim.sh to generate my own TIMs, I found the tim_summed.txt output file was not for regions but for CpG sites. After debugging, I found there is a code error in this line "awk -v window="$window_size" '{print $1 "\t" $2-$window "\t" $3+$window}' $output_file"sites" > $output_file""$window_size". It should be no $ before window inside awk.

Hope this is helpful.

CelFiE with normal lung tissue samples

Hi Christa,

Thank you for developing CelFiE. I am trying to run CelFiE on 2 samples (adult normal lung tissues-WGBS). After getting the coverage file from Bismark, I ran your prepare_bismark.sh script and created the input to CelFiE with the reference TIMs you provided. I ran CelFiE using the following command for both samples:

python em.py SRR3269863_reference_file_tims.txt 00-deconvolution_with_celfie 1 1000 0 1 0.001 1
python em.py SRR3274240.2_reference_file_tims.txt 00-deconvolution_with_celfie 1 1000 0 1 0.001 100

However, the cell_proportions from CelFiE for both samples indicate >90% placenta which should not be the case.

I am attaching the input files here for each sample as well as the output from CelFiE. I assumed that the output array from the pickle file is ordered in the same way as your reference_file_key.txt.

SRR3269863_reference_file_tims.txt
SRR3274240.2_reference_file_tims.txt
CelFiE_deconvolution_results.xlsx

Any help will be appreciated.

Thanks, Fayaz

celfie returns negative fractions

Dear Christa,
Thank you for developing celfie.
I am trying to apply the analysis to a very controlled dilution experiment we conducted, but I was puzzled to observe that celfie returned often negative fractions in the cell type estimation. Also, different runs lead to very different results (numerical instability?). Is this behaviour expected? I checked the input files and they seem to be fine.

I am deconvolving 20 samples simultaneously using 3 known pure components + 1 unknown, and other much simpler tools based on quadratic programming or regression on beta seem to easily capture the correct fractions up to few % of error.

Best

celfie throws when <num_samples>=2

Hi,

When I usu one sample, celfie runs well:

python celfie.py celfie_samples_combined_ref.txt dir 1

But when I try the same with two samples, it fails:

python celfie.py celfie_samples_combined_ref.txt dir 2

Here is the error message:

Traceback (most recent call last):
  File "celfie.py", line 373, in <module>
    alpha, gamma, ll = em(
  File "celfie.py", line 182, in em
    a, g = maximization(p0, p1, x, x_depths, y, y_depths)
  File "celfie.py", line 131, in maximization
    new_alpha[n, :] = np.dot(p1[:, :, n], x[n, :]) + np.matmul(
  File "<__array_function__ internals>", line 200, in dot
TypeError: can't multiply sequence by non-int of type 'float'

Input is in correct format. Do you have any idea what could go wrong here? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.