tlnagy / crispulator.jl Goto Github PK

View Code? Open in Web Editor NEW

16.0 4.0 6.0 8.62 MB

✂️ Pooled CRISPR screen optimization tool

License: Other

Julia 11.19% Jupyter Notebook 88.81%

functional-genomics julia simulation genomics

crispulator.jl's People

Contributors

Stargazers

Watchers

Forkers

standardgalactic todd-fallesen geng-lee mbutawan3 odow

crispulator.jl's Issues

scan_rep_space analysis is crashing on server

julia -p 62 src/run.jl scan_rep_space.jl data/bigiron_scan_rep_space.csv

Crashes with

Worker 15 terminated.
ERROR (unhandled task failure): EOFError: read end of file

Build out debug vs full runs of experiments

It would be very convenient to run through some toy values for experiments so I can add them to the tests #22 and debug them quicker. This could be activated via a "debug" flag or something.

Run() should be general

Run() as of b14990c is very specific. Adding two function pointers to the parameters would probably be enough to make it completely generic.

Due to some poor designing on my part, the Library type in library.jl is monolithic and fragile. It would be much better to have this as two separate types, Genome and Library. Then this new slimmed down Library type could be integrated with the already existing Cas9behavior type. Separately, num_genes and num_guides should be moved from the Screen types into Genome.

Additionally, instead of the hackish thing I'm doing here: https://github.com/tlnagy/pooled-screen-optimization/blob/41ff08519ad9775733c26552d43e1649d0962008/src/simulation/library.jl#L68-L83 with the combination of dictionaries and categorical distributions, I could use Types to encompass this information and make the whole thing cleaner.

Add noise to growth assay

Growth screens perform too well. Plot screen noise vs optimum number of runs to see how robust the latter value is to the former.

error on `exp` mode

Hi, I just saw an error message once I run the following command:

julia src/run.jl exp facs_binning.jl output.tsv

Here is the error message what I saw:

Loading analysis code...ERROR: LoadError: UndefVarError: output_file not defined
 in bootstrap_exp(::String, ::String, ::Int64, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:31
 in main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:17
 in include_from_node1(::String) at ./loading.jl:488
 in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at ./client.jl:265
 in _start() at ./client.jl:321
 in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34

IMHO, you should change the variable name of output_dir or output_file in bootstrap_exp at src/commands.jl.

Modularize and add integration tests

As it stands, crispulator is very fragile and not very well modularized. Additionally, it doesn't have integration tests and I've accidentally broken the command line interface several times. Modularization is basically a hard requirement for integration tests, that's why I grouped these two together.

These are all things I would like to fix before releasing crispulator into the wild. db90e49 was a first stab at this, but it didn't work because I spend enough time thinking how I'm going to wire everything up.

Split out and modularize plotting code

analysis.jl is getting hairy with all the plotting code in there as well as the actual analysis.

Pseudocounts should be used to handle genes with 0 counts

The new binning method introduced in f25e71e and including genes with 0 counts in 538c329 have an undesired effect of poisoning downstream analysis with Infs everywhere.

Probably the best course of action would be to add pseudocounts to every count similar to what we did during PUBS: https://github.com/tlnagy/seq-analysis/blob/master/seq_analysis.py#L60

Handle growth after transfection

Right now cells are immediately sorted after transfection. Usually they are grown up a bit prior to this happening.

"run.jl config" is not working

Hello @tlnagy, I followed an example of an example command line (from README.md), but it seems that is not working with an error message. Could you help me to figure out how to solve the issue?

julia version

➜  Crispulator.jl git:(master) ✗ julia --version
julia version 0.6.2

installed packages

julia> for (n, v) in Pkg.installed()
           println(string(n, " ",v))
       end
ForwardDiff 0.7.0
AxisAlgorithms 0.2.0
Juno 0.3.2
TextWrap 0.2.0
Nullables 0.0.1
DataStructures 0.7.3
Compat 0.41.0
Calculus 0.2.2
GZip 0.3.0
ShowItLikeYouBuildIt 0.2.0
Measures 0.1.0
StatsFuns 0.5.0
CoupledFields 0.0.1
DataFrames 0.10.1
SpecialFunctions 0.3.6
ArgParse 0.5.0
Showoff 0.1.1
Distributions 0.15.0
FixedPointNumbers 0.4.3
SHA 0.5.2
DualNumbers 0.3.0
Media 0.3.0
Combinatorics 0.5.0
KernelDensity 0.4.0
Iterators 0.3.1
ColorTypes 0.6.6
Polynomials 0.1.6
Gadfly 0.6.4
FileIO 0.6.1
Contour 0.4.0
tmp9Tif7W 0.0.0-
PDMats 0.8.0
Optim 0.11.0
SortingAlgorithms 0.2.0
Hiccup 0.1.1
CommonSubexpressions 0.0.1
WoodburyMatrices 0.2.2
JSON 0.16.3
StatsBase 0.19.4
Hexagons 0.1.0
YAML 0.2.1
Distances 0.5.0
Loess 0.3.0
NLSolversBase 3.1.0
QuadGK 0.2.0
StaticArrays 0.6.6
Crispulator 0.2.0
BinDeps 0.8.2
PositiveFactorizations 0.1.0
Parameters 0.8.1
Reexport 0.0.3
NaNMath 0.3.0
URIParser 0.3.0
HypothesisTests 0.6.0
Rmath 0.3.1
Interpolations 0.7.2
DiffResults 0.0.2
IterTools 0.2.0
LineSearches 3.2.3
Roots 0.4.2
Compose 0.5.4
DataArrays 0.6.2
DiffRules 0.0.2
ColorBrewer 0.3.1
Colors 0.8.2
MacroTools 0.4.0
Ratios 0.2.0
Codecs 0.4.0

Command line

julia src/run.jl config example_config.yml test_output

Error logs

INFO: Using 1 thread(s)
INFO: Loading simulation framework
ERROR: LoadError: MethodError: no method matching parse(::Dict{Any,Any})
The applicable method may be too new: running in world age 21841, while current world is 21983.
Closest candidates are:
  parse(::Dict{Any,Any}) at /Users/hwan/Sandbox/Crispulator.jl/src/utils/parse_yaml.jl:20 (method too new to be called from this world context.)
  parse(!Matched::Type{IPv4}, !Matched::AbstractString) at socket.jl:167
  parse(!Matched::Type{IPv6}, !Matched::AbstractString) at socket.jl:218
  ...
Stacktrace:
 [1] bootstrap_config(::String, ::String, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:79
 [2] main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:24
 [3] include_from_node1(::String) at ./loading.jl:576
 [4] include(::String) at ./sysimg.jl:14
 [5] process_options(::Base.JLOptions) at ./client.jl:305
 [6] _start() at ./client.jl:371
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34

Thank you,

Hyun-Hwan Jeong

Handle growth assays

Separating increasing and decreasing genes yield poor AUPRC/venn scores

Build out testing infrastructure

The simulation is getting more and more complex and it would be nice to have some tests set up to make sure that I don't introduce silent regressions.

Analysis

This issue will track some of the ideas we have for figures/analysis

Generate heatmaps of AUROC as a function of representation and # cells per bin at different noise levels
Generate heatmaps of AUROC with different analysis methods (pvalue vs logfc vs product, etc)
Tradeoff of different binning methods. More cells collected vs less stringency.
tradeoff between number of bottlenecks and strength of bottleneck

Specific pairwise comparisons to explore

CRISPRi vs CRISPR cutting, vary noise, sigmoidal vs linear, growth (+) vs (-)
metric for ranking hits (pvalue, effect size, product), vary noise
FACS: varying binning
Growth: number of rounds
Vary different levels of representation (transfection, bottleneck, sequencing)

Refactor experiments.jl

This file is getting too large with all the different experiments we're setting up. It might be better to separate all these out into separate files (since they are independent) and place them in a separate folder.

Build out documentation

Julia has a real nice documentation package https://github.com/JuliaDocs/Documenter.jl/ which would be nice to leverage. We could use github pages to host and use a UCSF domain for a nicer url. Something like crispulator.ucsf.edu.

Test all the things

Now that the monstrosity that is 1c31347 is merged, I need to do extensive validation that the two different CRISPR libraries are doing what I think they should be doing. This issue will track the necessary tests that need to be built.

test grow() with some fake cells, make sure it is behave properly
test select() for Facs screens
test select() for Growth screens
test auroc functions with fake data
transfection for growth screens
transfection for facs screens
command line behavior

auroc computation checks != negative label, not == positive label

This can lead to erroneous results when a "decreasing" gene is categorized as an "increasing" one and vice versa. This is likely another factor that contributed to the problem observed in #20.

Cleanup at the end of simulations horribly slow

Transposing the data at the end of the simulation (https://github.com/tlnagy/pooled-screen-optimization/blob/7960a2d2e0b7e892ebea598b1916f97115f4a0e3/src/run.jl#L62) is incredibly slow. So much so that is this priority.

Comparing methods code is broken

The new processing abstraction code broken the methods comparison code

Calling scan_best_methods and saving in ../data/output_test.csv
  3.428556 seconds (1.06 M allocations: 44.847 MB, 0.29% gc time)
ERROR: LoadError: ArgumentError: Length of nms doesn't match length of x.
 in names! at /Users/tamasnagy/.julia/v0.4/DataFrames/src/other/index.jl:28
 in scan_best_methods at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/experiments.jl:29
 in main at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl:36
 in include at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in include_from_node1 at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in process_options at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in _start at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
while loading /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl, in expression starting on line 42

Current design crashes with latest stats packages

The excision of Rmath from Base into Rmath.jl has kind of thrown a wrench into my current setup since it causes some precompilation issues: issue here -> JuliaLang/julia#17845.

This is a priority for the slim version of crispulator.

Performance of analysis code has degraded

Looks like I'm being bit by the type inferencing issues with my use of DataFrames in the analysis code. This reduction step is run enough times that this is a real issue. This is also probably causing large allocations, further degrading performance. It looks like the reduction step is almost 50% the time of a run now, which is quite the explosion since I last profiled the codebase.

"-Inf" values of obs_phenotype

To simulate growth screening, I wrote below notebook:

https://gist.github.com/hyunhwaj/b941b84224d22b9b4bc112ee3b904d38

After I ran differences_between_bins, I am able to have bc_counts variable, but all of the values of obs_phenotype are -Inf. I don't see the issue when I simulate FACS screening. I am doubting the -Inf values are correct, and we might have other values which are close to theo_phenotype. Also, it is possible my code on the notebook is wrong. I will appreciate if it is true and you inform me what mistakes were in the code.

Thank you,

Hyun-Hwan Jeong

Fix performance issues

It takes way too long to run one analysis all the way through, about 0.916937 seconds as of 3274b9e.

Profiling could go a long way here. A quick run through seems to point at sequencing being the biggest offender right now with about 0.781344 seconds spent there.

Construct a subset of actual negative control guides

Simply add another class to the inactive, up, and down options.

Linear function isn't linear

Should go from (0,0) to (1, max_phenotype)

Running tests from terminal crashes due to loading full run script

Random bugs in library plotting

~~- [ ] histogram not visible when using log scale on y-axis and colors.~~ this is a Gadfly problem

plotting linear vs sigmoid example is completely broken
plot of guide distribution after library construction is missing

Alternative sigmoid function

The behavior of the current sigmoid function (https://github.com/tlnagy/pooled-screen-optimization/blob/9102f72f38466108ac4a182fa9ed1949f12f5a19/src/library.jl#L5) is not ideal. A 100% KD doesn't always correspond to the maximum phenotype so the tests fail.

Things remaining to do for slim crispulator

From the Google doc:

add generation of volcano and scatter plots
SNR
Venn (confidence interval)
AURPC (confidence interval)

Exporting count matrix

Hello,

I am wondering how I can export the count matrix which is derived from the input parameters. I just only able to have the values of screening performance of the given simulated study as the output of Crispulator. Are you providing any output/function which reveals the count matrix of each sample? If it is not then could you advise me to implement the function?

Thank you,

Guides should store their initial frequencies

Currently, after transfecting cells, the initial frequencies of all the barcodes isn't stored, which makes diagnosing some downstream issues related to #5 harder.

Add additional quality metrics

Evaluating model performance solely based on AUROCs might not be ideal because they aren't well known outside of the biostats/ML community and our simulated data (like real life examples) are skewed towards having more negatives.

add support for simple venn diagram-like method, i.e. % correct of top 25 pos/neg hits
AUPRC, precision-recall might be better than AUROC since the data is skewed towards negatives

Separation based on p-value or abs(mean) unrealistically good

Even with the new method introduced in f25e71e of sampling an "observed phenotype" from a Normal distribution centered at the "theoretical phenotype" with a stddev of 0.5, the separation is unrealistically good. Here I'm sorting guides based on their "observed phenotype" and taking the bottom 1/3 and putting it in one bin and the top 1/3 in another and comparing:

Distribution of "observed phenotypes" that was used to bin

Log-log plot of guide frequencies in the two bins

Volcano plot

It's super apparent when I collapse the results down to the gene-level:

What do you think @martinkampmann? I'm ending up with AUROC's of 0.999.

Refactor into package

Related to #16, it would be much nicer if all the code could be distributed as a Julia package. Not much extra has to be done here and it would avoid jankiness like in 0cd1c39

Allow for multiple doublings before sampling

Currently, the way that bottlenecks work is that cells are grown up till they are larger the bottleneck_representation and subsampled down to this value. Then each step the number of cells is doubled and then half are thrown out. This is repeated num_bottlenecks number of times.

@martinkampmann Would it be interesting to allow cells to double N times and then 1/(2*N) cells are kept after the bottleneck?

Unexpected behavior by decreasing genes

While analyzing the results of the growth screens #11, I noticed some weird behavior by the "decreasing" genes. These are the genes are that are expected to fall out of the experiment as you increase the number of bottlenecks and you can see this decrease with most combinations of growth screen parameters:

but you can also see that the auroc scores then improve as you increase the number of bottlenecks further! The increasing genes behave as expected. This behavior might be a result of a bug in how I compute aurocs. Since I only have summary values for now, it's hard to say what is causing this, but I want to make sure that this is explored fully.

Error tagging new release

The tag name "v0.2" is not of the appropriate SemVer form (vX.Y.Z).
cc: @tlnagy

Clean up terminology and improve consistency

Right now num_cells_per_bin and sequencing depth are inconsistent with the rest of the parameters because they are not defined as being 10x, 100x, etc of the number of guides.

Normalize to median of negative controls

This wasn't a problem with FACS screening, but growth screens require normalization to the median of negative controls.

Maybe I should also consider what Max does and normalize using the stddev of the negative controls too?

Handle CRISPR cutting

Outline

The initial version of the model was built for CRISPRi which behaves pretty consistently in cells. CRISPR cutting, on the other hand, is quite stochastic. Our observed phenotype would be quite bimodal, depending on whether cutting was successful or not.

Test simulation on Windows

Plotting on 32-bit windows is broken

This is due to some hardcoded bits types upstream in Gadfly.jl. I don't know how big of a problem this is, but I'm going to leave this issue here so that people can find it.

Improve terminology

Our terminology could be improved in several areas so I'm triaging the worst offenders:

~~"Increasing" vs "Decreasing" genes~~ (Positive vs Negative phenotype)
"Noise" in Facs Screens
"Bin size" in Facs Screens

Add support for yaml encoded run parameters

Currently, the only way to run a custom experiment is to write a full julia file and put it into the src/exps/ directory. Something like a yaml text file could be much more user friendly. See https://github.com/dcjones/YAML.jl

YAML is nice because unlike other structured text files (e.g. JSON) it is also easy to read and supports comments.

I'm collecting a list of interesting "knobs" that users might want to turn:

Library "knobs"

Screen "knobs"

Growth or Facs
Number of bottlenecks (Growth only)
Bin size (FACS only)
Representation at transfection
Representation at selection
Representation at sequencing
Standard deviation of "noise" (FACS only)

@martinkampmann Thoughts?

auprc implementation is unnecessarily O(n^2)

I'm already looping over all the scores, so doing it twice is wasteful and slow.

auprc computed incorrectly if first label is a true hit

scores = Float64[80,5,157,169,158,166,115,18,6,78,31,5]
classes = [:c,:a,:a,:b,:b,:b,:c,:c,:a,:c,:a,:a]
_auprc, p, r = auprc(scores, classes, Set([:b]))

Add analysis that generates plots for various runs

Just as a sanity check if would be useful to have a test that generates the volcano plots, raw counts plots, and precision-recall plots for a couple different conditions as a sanity check. I haven't created these in a while so this would be nice check before I generate too much data.

tlnagy / crispulator.jl Goto Github PK

crispulator.jl's People

Contributors

Stargazers

Watchers

Forkers

crispulator.jl's Issues

Distribution of "observed phenotypes" that was used to bin

Log-log plot of guide frequencies in the two bins

Volcano plot

Outline

Recommend Projects

Recommend Topics

Recommend Org