Giter Site home page Giter Site logo

crispulator.jl's People

Contributors

odow avatar tlnagy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crispulator.jl's Issues

error on `exp` mode

Hi, I just saw an error message once I run the following command:

julia src/run.jl exp facs_binning.jl output.tsv

Here is the error message what I saw:

Loading analysis code...ERROR: LoadError: UndefVarError: output_file not defined
 in bootstrap_exp(::String, ::String, ::Int64, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:31
 in main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:17
 in include_from_node1(::String) at ./loading.jl:488
 in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at ./client.jl:265
 in _start() at ./client.jl:321
 in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34

IMHO, you should change the variable name of output_dir or output_file in bootstrap_exp at src/commands.jl.

"run.jl config" is not working

Hello @tlnagy, I followed an example of an example command line (from README.md), but it seems that is not working with an error message. Could you help me to figure out how to solve the issue?

  • julia version
➜  Crispulator.jl git:(master) ✗ julia --version
julia version 0.6.2
  • installed packages
julia> for (n, v) in Pkg.installed()
           println(string(n, " ",v))
       end
ForwardDiff 0.7.0
AxisAlgorithms 0.2.0
Juno 0.3.2
TextWrap 0.2.0
Nullables 0.0.1
DataStructures 0.7.3
Compat 0.41.0
Calculus 0.2.2
GZip 0.3.0
ShowItLikeYouBuildIt 0.2.0
Measures 0.1.0
StatsFuns 0.5.0
CoupledFields 0.0.1
DataFrames 0.10.1
SpecialFunctions 0.3.6
ArgParse 0.5.0
Showoff 0.1.1
Distributions 0.15.0
FixedPointNumbers 0.4.3
SHA 0.5.2
DualNumbers 0.3.0
Media 0.3.0
Combinatorics 0.5.0
KernelDensity 0.4.0
Iterators 0.3.1
ColorTypes 0.6.6
Polynomials 0.1.6
Gadfly 0.6.4
FileIO 0.6.1
Contour 0.4.0
tmp9Tif7W 0.0.0-
PDMats 0.8.0
Optim 0.11.0
SortingAlgorithms 0.2.0
Hiccup 0.1.1
CommonSubexpressions 0.0.1
WoodburyMatrices 0.2.2
JSON 0.16.3
StatsBase 0.19.4
Hexagons 0.1.0
YAML 0.2.1
Distances 0.5.0
Loess 0.3.0
NLSolversBase 3.1.0
QuadGK 0.2.0
StaticArrays 0.6.6
Crispulator 0.2.0
BinDeps 0.8.2
PositiveFactorizations 0.1.0
Parameters 0.8.1
Reexport 0.0.3
NaNMath 0.3.0
URIParser 0.3.0
HypothesisTests 0.6.0
Rmath 0.3.1
Interpolations 0.7.2
DiffResults 0.0.2
IterTools 0.2.0
LineSearches 3.2.3
Roots 0.4.2
Compose 0.5.4
DataArrays 0.6.2
DiffRules 0.0.2
ColorBrewer 0.3.1
Colors 0.8.2
MacroTools 0.4.0
Ratios 0.2.0
Codecs 0.4.0
  • Command line
julia src/run.jl config example_config.yml test_output
  • Error logs
INFO: Using 1 thread(s)
INFO: Loading simulation framework
ERROR: LoadError: MethodError: no method matching parse(::Dict{Any,Any})
The applicable method may be too new: running in world age 21841, while current world is 21983.
Closest candidates are:
  parse(::Dict{Any,Any}) at /Users/hwan/Sandbox/Crispulator.jl/src/utils/parse_yaml.jl:20 (method too new to be called from this world context.)
  parse(!Matched::Type{IPv4}, !Matched::AbstractString) at socket.jl:167
  parse(!Matched::Type{IPv6}, !Matched::AbstractString) at socket.jl:218
  ...
Stacktrace:
 [1] bootstrap_config(::String, ::String, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:79
 [2] main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:24
 [3] include_from_node1(::String) at ./loading.jl:576
 [4] include(::String) at ./sysimg.jl:14
 [5] process_options(::Base.JLOptions) at ./client.jl:305
 [6] _start() at ./client.jl:371
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34

Thank you,

Hyun-Hwan Jeong

Handle CRISPR cutting

Outline

The initial version of the model was built for CRISPRi which behaves pretty consistently in cells. CRISPR cutting, on the other hand, is quite stochastic. Our observed phenotype would be quite bimodal, depending on whether cutting was successful or not.

Performance of analysis code has degraded

Looks like I'm being bit by the type inferencing issues with my use of DataFrames in the analysis code. This reduction step is run enough times that this is a real issue. This is also probably causing large allocations, further degrading performance. It looks like the reduction step is almost 50% the time of a run now, which is quite the explosion since I last profiled the codebase.

flame

Clean up terminology and improve consistency

Right now num_cells_per_bin and sequencing depth are inconsistent with the rest of the parameters because they are not defined as being 10x, 100x, etc of the number of guides.

Build out testing infrastructure

The simulation is getting more and more complex and it would be nice to have some tests set up to make sure that I don't introduce silent regressions.

Plotting on 32-bit windows is broken

This is due to some hardcoded bits types upstream in Gadfly.jl. I don't know how big of a problem this is, but I'm going to leave this issue here so that people can find it.

Improve terminology

Our terminology could be improved in several areas so I'm triaging the worst offenders:

  • "Increasing" vs "Decreasing" genes (Positive vs Negative phenotype)
  • "Noise" in Facs Screens
  • "Bin size" in Facs Screens

Specific pairwise comparisons to explore

  • CRISPRi vs CRISPR cutting, vary noise, sigmoidal vs linear, growth (+) vs (-)
  • metric for ranking hits (pvalue, effect size, product), vary noise
  • FACS: varying binning
  • Growth: number of rounds
  • Vary different levels of representation (transfection, bottleneck, sequencing)

Build out debug vs full runs of experiments

It would be very convenient to run through some toy values for experiments so I can add them to the tests #22 and debug them quicker. This could be activated via a "debug" flag or something.

Unexpected behavior by decreasing genes

While analyzing the results of the growth screens #11, I noticed some weird behavior by the "decreasing" genes. These are the genes are that are expected to fall out of the experiment as you increase the number of bottlenecks and you can see this decrease with most combinations of growth screen parameters:

growth_screen_perf_param_dependence

but you can also see that the auroc scores then improve as you increase the number of bottlenecks further! The increasing genes behave as expected. This behavior might be a result of a bug in how I compute aurocs. Since I only have summary values for now, it's hard to say what is causing this, but I want to make sure that this is explored fully.

Fix performance issues

It takes way too long to run one analysis all the way through, about 0.916937 seconds as of 3274b9e.

Profiling could go a long way here. A quick run through seems to point at sequencing being the biggest offender right now with about 0.781344 seconds spent there.

Analysis

This issue will track some of the ideas we have for figures/analysis

  • Generate heatmaps of AUROC as a function of representation and # cells per bin at different noise levels
  • Generate heatmaps of AUROC with different analysis methods (pvalue vs logfc vs product, etc)
  • Tradeoff of different binning methods. More cells collected vs less stringency.
  • tradeoff between number of bottlenecks and strength of bottleneck

Exporting count matrix

Hello,

I am wondering how I can export the count matrix which is derived from the input parameters. I just only able to have the values of screening performance of the given simulated study as the output of Crispulator. Are you providing any output/function which reveals the count matrix of each sample? If it is not then could you advise me to implement the function?

Thank you,

Add analysis that generates plots for various runs

Just as a sanity check if would be useful to have a test that generates the volcano plots, raw counts plots, and precision-recall plots for a couple different conditions as a sanity check. I haven't created these in a while so this would be nice check before I generate too much data.

Restructure library.jl

Due to some poor designing on my part, the Library type in library.jl is monolithic and fragile. It would be much better to have this as two separate types, Genome and Library. Then this new slimmed down Library type could be integrated with the already existing Cas9behavior type. Separately, num_genes and num_guides should be moved from the Screen types into Genome.

Additionally, instead of the hackish thing I'm doing here: https://github.com/tlnagy/pooled-screen-optimization/blob/41ff08519ad9775733c26552d43e1649d0962008/src/simulation/library.jl#L68-L83 with the combination of dictionaries and categorical distributions, I could use Types to encompass this information and make the whole thing cleaner.

Test all the things

Now that the monstrosity that is 1c31347 is merged, I need to do extensive validation that the two different CRISPR libraries are doing what I think they should be doing. This issue will track the necessary tests that need to be built.

  • test grow() with some fake cells, make sure it is behave properly
  • test select() for Facs screens
  • test select() for Growth screens
  • test auroc functions with fake data
  • transfection for growth screens
  • transfection for facs screens
  • command line behavior

Allow for multiple doublings before sampling

Currently, the way that bottlenecks work is that cells are grown up till they are larger the bottleneck_representation and subsampled down to this value. Then each step the number of cells is doubled and then half are thrown out. This is repeated num_bottlenecks number of times.

@martinkampmann Would it be interesting to allow cells to double N times and then 1/(2*N) cells are kept after the bottleneck?

Refactor into package

Related to #16, it would be much nicer if all the code could be distributed as a Julia package. Not much extra has to be done here and it would avoid jankiness like in 0cd1c39

Separation based on p-value or abs(mean) unrealistically good

Even with the new method introduced in f25e71e of sampling an "observed phenotype" from a Normal distribution centered at the "theoretical phenotype" with a stddev of 0.5, the separation is unrealistically good. Here I'm sorting guides based on their "observed phenotype" and taking the bottom 1/3 and putting it in one bin and the top 1/3 in another and comparing:

Distribution of "observed phenotypes" that was used to bin

distributions_of_observed_phenotypes

Log-log plot of guide frequencies in the two bins

freq_scatter

Volcano plot

It's super apparent when I collapse the results down to the gene-level:

volcano_plot_by_class

What do you think @martinkampmann? I'm ending up with AUROC's of 0.999.

Random bugs in library plotting

- [ ] histogram not visible when using log scale on y-axis and colors. this is a Gadfly problem

  • plotting linear vs sigmoid example is completely broken
  • plot of guide distribution after library construction is missing

Modularize and add integration tests

As it stands, crispulator is very fragile and not very well modularized. Additionally, it doesn't have integration tests and I've accidentally broken the command line interface several times. Modularization is basically a hard requirement for integration tests, that's why I grouped these two together.

These are all things I would like to fix before releasing crispulator into the wild. db90e49 was a first stab at this, but it didn't work because I spend enough time thinking how I'm going to wire everything up.

Add support for yaml encoded run parameters

Currently, the only way to run a custom experiment is to write a full julia file and put it into the src/exps/ directory. Something like a yaml text file could be much more user friendly. See https://github.com/dcjones/YAML.jl

YAML is nice because unlike other structured text files (e.g. JSON) it is also easy to read and supports comments.

I'm collecting a list of interesting "knobs" that users might want to turn:

Library "knobs"

  • number of genes
  • number of guides
  • CRISPRi or CRISPRn
  • Fraction of high-activity guides
  • Mean knockdown for high-activity guides
  • Fraction "increasing" genes
  • Fraction "decreasing" genes

Screen "knobs"

  • Growth or Facs
  • Number of bottlenecks (Growth only)
  • Bin size (FACS only)
  • Representation at transfection
  • Representation at selection
  • Representation at sequencing
  • Standard deviation of "noise" (FACS only)

@martinkampmann Thoughts?

Normalize to median of negative controls

This wasn't a problem with FACS screening, but growth screens require normalization to the median of negative controls.

Maybe I should also consider what Max does and normalize using the stddev of the negative controls too?

Add noise to growth assay

Growth screens perform too well. Plot screen noise vs optimum number of runs to see how robust the latter value is to the former.

"-Inf" values of obs_phenotype

To simulate growth screening, I wrote below notebook:

https://gist.github.com/hyunhwaj/b941b84224d22b9b4bc112ee3b904d38

After I ran differences_between_bins, I am able to have bc_counts variable, but all of the values of obs_phenotype are -Inf. I don't see the issue when I simulate FACS screening. I am doubting the -Inf values are correct, and we might have other values which are close to theo_phenotype. Also, it is possible my code on the notebook is wrong. I will appreciate if it is true and you inform me what mistakes were in the code.

Thank you,

Hyun-Hwan Jeong

Refactor experiments.jl

This file is getting too large with all the different experiments we're setting up. It might be better to separate all these out into separate files (since they are independent) and place them in a separate folder.

Comparing methods code is broken

The new processing abstraction code broken the methods comparison code

Calling scan_best_methods and saving in ../data/output_test.csv
  3.428556 seconds (1.06 M allocations: 44.847 MB, 0.29% gc time)
ERROR: LoadError: ArgumentError: Length of nms doesn't match length of x.
 in names! at /Users/tamasnagy/.julia/v0.4/DataFrames/src/other/index.jl:28
 in scan_best_methods at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/experiments.jl:29
 in main at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl:36
 in include at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in include_from_node1 at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in process_options at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
 in _start at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
while loading /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl, in expression starting on line 42

Add additional quality metrics

Evaluating model performance solely based on AUROCs might not be ideal because they aren't well known outside of the biostats/ML community and our simulated data (like real life examples) are skewed towards having more negatives.

  • add support for simple venn diagram-like method, i.e. % correct of top 25 pos/neg hits
  • AUPRC, precision-recall might be better than AUROC since the data is skewed towards negatives

Run() should be general

Run() as of b14990c is very specific. Adding two function pointers to the parameters would probably be enough to make it completely generic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.