tlnagy / crispulator.jl Goto Github PK
View Code? Open in Web Editor NEW✂️ Pooled CRISPR screen optimization tool
License: Other
✂️ Pooled CRISPR screen optimization tool
License: Other
Hi, I just saw an error message once I run the following command:
julia src/run.jl exp facs_binning.jl output.tsv
Here is the error message what I saw:
Loading analysis code...ERROR: LoadError: UndefVarError: output_file not defined
in bootstrap_exp(::String, ::String, ::Int64, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:31
in main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:17
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:265
in _start() at ./client.jl:321
in _start() at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34
IMHO, you should change the variable name of output_dir
or output_file
in bootstrap_exp
at src/commands.jl
.
Hello @tlnagy, I followed an example of an example command line (from README.md
), but it seems that is not working with an error message. Could you help me to figure out how to solve the issue?
julia
version➜ Crispulator.jl git:(master) ✗ julia --version
julia version 0.6.2
julia> for (n, v) in Pkg.installed()
println(string(n, " ",v))
end
ForwardDiff 0.7.0
AxisAlgorithms 0.2.0
Juno 0.3.2
TextWrap 0.2.0
Nullables 0.0.1
DataStructures 0.7.3
Compat 0.41.0
Calculus 0.2.2
GZip 0.3.0
ShowItLikeYouBuildIt 0.2.0
Measures 0.1.0
StatsFuns 0.5.0
CoupledFields 0.0.1
DataFrames 0.10.1
SpecialFunctions 0.3.6
ArgParse 0.5.0
Showoff 0.1.1
Distributions 0.15.0
FixedPointNumbers 0.4.3
SHA 0.5.2
DualNumbers 0.3.0
Media 0.3.0
Combinatorics 0.5.0
KernelDensity 0.4.0
Iterators 0.3.1
ColorTypes 0.6.6
Polynomials 0.1.6
Gadfly 0.6.4
FileIO 0.6.1
Contour 0.4.0
tmp9Tif7W 0.0.0-
PDMats 0.8.0
Optim 0.11.0
SortingAlgorithms 0.2.0
Hiccup 0.1.1
CommonSubexpressions 0.0.1
WoodburyMatrices 0.2.2
JSON 0.16.3
StatsBase 0.19.4
Hexagons 0.1.0
YAML 0.2.1
Distances 0.5.0
Loess 0.3.0
NLSolversBase 3.1.0
QuadGK 0.2.0
StaticArrays 0.6.6
Crispulator 0.2.0
BinDeps 0.8.2
PositiveFactorizations 0.1.0
Parameters 0.8.1
Reexport 0.0.3
NaNMath 0.3.0
URIParser 0.3.0
HypothesisTests 0.6.0
Rmath 0.3.1
Interpolations 0.7.2
DiffResults 0.0.2
IterTools 0.2.0
LineSearches 3.2.3
Roots 0.4.2
Compose 0.5.4
DataArrays 0.6.2
DiffRules 0.0.2
ColorBrewer 0.3.1
Colors 0.8.2
MacroTools 0.4.0
Ratios 0.2.0
Codecs 0.4.0
julia src/run.jl config example_config.yml test_output
INFO: Using 1 thread(s)
INFO: Loading simulation framework
ERROR: LoadError: MethodError: no method matching parse(::Dict{Any,Any})
The applicable method may be too new: running in world age 21841, while current world is 21983.
Closest candidates are:
parse(::Dict{Any,Any}) at /Users/hwan/Sandbox/Crispulator.jl/src/utils/parse_yaml.jl:20 (method too new to be called from this world context.)
parse(!Matched::Type{IPv4}, !Matched::AbstractString) at socket.jl:167
parse(!Matched::Type{IPv6}, !Matched::AbstractString) at socket.jl:218
...
Stacktrace:
[1] bootstrap_config(::String, ::String, ::Bool) at /Users/hwan/Sandbox/Crispulator.jl/src/commands.jl:79
[2] main() at /Users/hwan/Sandbox/Crispulator.jl/src/run.jl:24
[3] include_from_node1(::String) at ./loading.jl:576
[4] include(::String) at ./sysimg.jl:14
[5] process_options(::Base.JLOptions) at ./client.jl:305
[6] _start() at ./client.jl:371
while loading /Users/hwan/Sandbox/Crispulator.jl/src/run.jl, in expression starting on line 34
Thank you,
Hyun-Hwan Jeong
Should go from (0,0) to (1, max_phenotype)
The initial version of the model was built for CRISPRi which behaves pretty consistently in cells. CRISPR cutting, on the other hand, is quite stochastic. Our observed phenotype would be quite bimodal, depending on whether cutting was successful or not.
Looks like I'm being bit by the type inferencing issues with my use of DataFrames in the analysis code. This reduction step is run enough times that this is a real issue. This is also probably causing large allocations, further degrading performance. It looks like the reduction step is almost 50% the time of a run now, which is quite the explosion since I last profiled the codebase.
Right now num_cells_per_bin
and sequencing depth
are inconsistent with the rest of the parameters because they are not defined as being 10x, 100x, etc of the number of guides.
The behavior of the current sigmoid function (https://github.com/tlnagy/pooled-screen-optimization/blob/9102f72f38466108ac4a182fa9ed1949f12f5a19/src/library.jl#L5) is not ideal. A 100% KD doesn't always correspond to the maximum phenotype so the tests fail.
The simulation is getting more and more complex and it would be nice to have some tests set up to make sure that I don't introduce silent regressions.
This is due to some hardcoded bits types upstream in Gadfly.jl. I don't know how big of a problem this is, but I'm going to leave this issue here so that people can find it.
Our terminology could be improved in several areas so I'm triaging the worst offenders:
It would be very convenient to run through some toy values for experiments so I can add them to the tests #22 and debug them quicker. This could be activated via a "debug" flag or something.
Simply add another class
to the inactive, up, and down options.
This can lead to erroneous results when a "decreasing" gene is categorized as an "increasing" one and vice versa. This is likely another factor that contributed to the problem observed in #20.
Right now cells are immediately sorted after transfection. Usually they are grown up a bit prior to this happening.
While analyzing the results of the growth screens #11, I noticed some weird behavior by the "decreasing" genes. These are the genes are that are expected to fall out of the experiment as you increase the number of bottlenecks and you can see this decrease with most combinations of growth screen parameters:
but you can also see that the auroc scores then improve as you increase the number of bottlenecks further! The increasing genes behave as expected. This behavior might be a result of a bug in how I compute aurocs. Since I only have summary values for now, it's hard to say what is causing this, but I want to make sure that this is explored fully.
analysis.jl
is getting hairy with all the plotting code in there as well as the actual analysis.
Julia has a real nice documentation package https://github.com/JuliaDocs/Documenter.jl/ which would be nice to leverage. We could use github pages to host and use a UCSF domain for a nicer url. Something like crispulator.ucsf.edu
.
Transposing the data at the end of the simulation (https://github.com/tlnagy/pooled-screen-optimization/blob/7960a2d2e0b7e892ebea598b1916f97115f4a0e3/src/run.jl#L62) is incredibly slow. So much so that is this priority.
It takes way too long to run one analysis all the way through, about 0.916937 seconds as of 3274b9e.
Profiling could go a long way here. A quick run through seems to point at sequencing being the biggest offender right now with about 0.781344 seconds spent there.
This issue will track some of the ideas we have for figures/analysis
Hello,
I am wondering how I can export the count matrix which is derived from the input parameters. I just only able to have the values of screening performance of the given simulated study as the output of Crispulator. Are you providing any output/function which reveals the count matrix of each sample? If it is not then could you advise me to implement the function?
Thank you,
Just as a sanity check if would be useful to have a test that generates the volcano plots, raw counts plots, and precision-recall plots for a couple different conditions as a sanity check. I haven't created these in a while so this would be nice check before I generate too much data.
Due to some poor designing on my part, the Library type in library.jl
is monolithic and fragile. It would be much better to have this as two separate types, Genome and Library. Then this new slimmed down Library type could be integrated with the already existing Cas9behavior type. Separately, num_genes
and num_guides
should be moved from the Screen types into Genome.
Additionally, instead of the hackish thing I'm doing here: https://github.com/tlnagy/pooled-screen-optimization/blob/41ff08519ad9775733c26552d43e1649d0962008/src/simulation/library.jl#L68-L83 with the combination of dictionaries and categorical distributions, I could use Types to encompass this information and make the whole thing cleaner.
Now that the monstrosity that is 1c31347 is merged, I need to do extensive validation that the two different CRISPR libraries are doing what I think they should be doing. This issue will track the necessary tests that need to be built.
grow()
with some fake cells, make sure it is behave properlyselect()
for Facs screensselect()
for Growth screensauroc
functions with fake dataCurrently, after transfecting cells, the initial frequencies of all the barcodes isn't stored, which makes diagnosing some downstream issues related to #5 harder.
Currently, the way that bottlenecks work is that cells are grown up till they are larger the bottleneck_representation
and subsampled down to this value. Then each step the number of cells is doubled and then half are thrown out. This is repeated num_bottlenecks
number of times.
@martinkampmann Would it be interesting to allow cells to double N
times and then 1/(2*N)
cells are kept after the bottleneck?
From the Google doc:
julia -p 62 src/run.jl scan_rep_space.jl data/bigiron_scan_rep_space.csv
Crashes with
Worker 15 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Even with the new method introduced in f25e71e of sampling an "observed phenotype" from a Normal distribution centered at the "theoretical phenotype" with a stddev of 0.5, the separation is unrealistically good. Here I'm sorting guides based on their "observed phenotype" and taking the bottom 1/3 and putting it in one bin and the top 1/3 in another and comparing:
It's super apparent when I collapse the results down to the gene-level:
What do you think @martinkampmann? I'm ending up with AUROC's of 0.999.
- [ ] histogram not visible when using log scale on y-axis and colors. this is a Gadfly problem
As it stands, crispulator
is very fragile and not very well modularized. Additionally, it doesn't have integration tests and I've accidentally broken the command line interface several times. Modularization is basically a hard requirement for integration tests, that's why I grouped these two together.
These are all things I would like to fix before releasing crispulator
into the wild. db90e49 was a first stab at this, but it didn't work because I spend enough time thinking how I'm going to wire everything up.
Currently, the only way to run a custom experiment is to write a full julia file and put it into the src/exps/
directory. Something like a yaml text file could be much more user friendly. See https://github.com/dcjones/YAML.jl
YAML is nice because unlike other structured text files (e.g. JSON) it is also easy to read and supports comments.
I'm collecting a list of interesting "knobs" that users might want to turn:
Library "knobs"
Screen "knobs"
@martinkampmann Thoughts?
This wasn't a problem with FACS screening, but growth screens require normalization to the median of negative controls.
Maybe I should also consider what Max does and normalize using the stddev of the negative controls too?
Growth screens perform too well. Plot screen noise vs optimum number of runs to see how robust the latter value is to the former.
To simulate growth screening, I wrote below notebook:
https://gist.github.com/hyunhwaj/b941b84224d22b9b4bc112ee3b904d38
After I ran differences_between_bins
, I am able to have bc_counts
variable, but all of the values of obs_phenotype
are -Inf
. I don't see the issue when I simulate FACS screening. I am doubting the -Inf
values are correct, and we might have other values which are close to theo_phenotype
. Also, it is possible my code on the notebook is wrong. I will appreciate if it is true and you inform me what mistakes were in the code.
Thank you,
Hyun-Hwan Jeong
The tag name "v0.2" is not of the appropriate SemVer form (vX.Y.Z).
cc: @tlnagy
The new binning method introduced in f25e71e and including genes with 0 counts in 538c329 have an undesired effect of poisoning downstream analysis with Inf
s everywhere.
Probably the best course of action would be to add pseudocounts to every count similar to what we did during PUBS: https://github.com/tlnagy/seq-analysis/blob/master/seq_analysis.py#L60
This file is getting too large with all the different experiments we're setting up. It might be better to separate all these out into separate files (since they are independent) and place them in a separate folder.
The new processing abstraction code broken the methods comparison code
Calling scan_best_methods and saving in ../data/output_test.csv
3.428556 seconds (1.06 M allocations: 44.847 MB, 0.29% gc time)
ERROR: LoadError: ArgumentError: Length of nms doesn't match length of x.
in names! at /Users/tamasnagy/.julia/v0.4/DataFrames/src/other/index.jl:28
in scan_best_methods at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/experiments.jl:29
in main at /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl:36
in include at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
in include_from_node1 at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
in process_options at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
in _start at /usr/local/Cellar/julia/0.4.5/lib/julia/sys.dylib
while loading /Users/tamasnagy/Documents/Classes/kampmann_rotation/pooled-screen-optimization/src/run.jl, in expression starting on line 42
The excision of Rmath from Base into Rmath.jl
has kind of thrown a wrench into my current setup since it causes some precompilation issues: issue here -> JuliaLang/julia#17845.
This is a priority for the slim version of crispulator.
I'm already looping over all the scores, so doing it twice is wasteful and slow.
Evaluating model performance solely based on AUROCs might not be ideal because they aren't well known outside of the biostats/ML community and our simulated data (like real life examples) are skewed towards having more negatives.
Run() as of b14990c is very specific. Adding two function pointers to the parameters would probably be enough to make it completely generic.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.