smithlabcode / preseqr Goto Github PK
View Code? Open in Web Editor NEWAn R package for estimating library complexity
An R package for estimating library complexity
I think this is strongly discouraged, because in some (all??) versions of R, the underscore character can be interpreted as assignment. The typical replacement is to use a dot ('.'). Do you agree @timydaley ?
Maybe a stupid idea. Since preseqR.rSAC.bootstrap
is bootstrap version of preseqR.rSAC
, I expected the returned object to behave similarly. Now estimator returned by preseqR.rSAC
is a function and that returned by preseqR.rSAC.bootstrap
is a list, you need to use f
inside to take the value of r
Is a very bad name for a file...
When working with paths, for example joining paths, use the "os" stuff. Like this:
import os
fullPath = os.path.join(srcPath, cppPath)
and make sure not to use the stuff in shutil when more standard Python is easily used instead. Also, in Python it is typical to use camel-case, and to start variables with lowercase.
For replacing the "hpp" with "h" the best way might involve a simple regex replace.
When using any functions associated with the Fisher estimator (fisher.alpha, fisher.rSAC
), I get these warnings:
1: In uniroot(function(x) (exp(x) - 1)/x - N/S, c(0.001, ... :
NA/Inf replaced by maximum positive value
2: In uniroot(function(x) (exp(x) - 1)/x - N/S, c(0.001, ... :
NA/Inf replaced by maximum positive value
...```
I think it's might be good to add a short tutorial on how to run the example dataset in README so that user can easily try out the example dataset after installation.
The *.so file should not be in the repo. What command put it there?
Does ZTNB belong in this directory? Also, typically tar.gz should not be included in the repo.
It seems like everything from preseqR should be included as a submodule of preseq_dev. Does anyone have thoughts on this? @chaodengusc @timydaley
kmer.frac.curve.bootstrap
and preseqR.optimal.sequencing
both give the error Recycling array of length 1 in vector-array arithmetic is deprecated.
However, when I try to convert the value using as.numeric()
or as.vector()
, the output changes. Need to investigate more deeply.
It is not clear why so much existing code is duplicated in these files. It does not make sense that the only way to include other code from smithlab_cpp and from preseq is to actually copy it into new files.
The speed of preseqR.continued.fraction.estimate and bootstrap.complexity.curve should no longer be an issue. The correctness of preseqR.continued.fraction.estimate is checked by comparing its results with preseq.
Implementation of bootstrap.complexity.curve is to resample histogram, run preseqR.continued.fraction.estimate and summarize all results to make estimation. However, results from this function are quite different from results in preseq, which causes suspicious. @timydaley could you check the code of bootstrap.complexity.curve?
I think some of the files are needing removal or revision, and also there should be no nested preseqR inside preseqR.
Given the same histogram plus same parameters to estimate continued fraction, the c++ function costs 0.27 sys time, while the R function cost 4.461 sys time. The difference is more than 20 times. It becomes intolerant when I try to use bootstrap to estimate the complexity curve as the time consuming are multiplied by the times of resampling.
Would it be possible to add a tutorial (requested in Issue #21 as well) in the form of the original publication's supplementary data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885658/ ? Since there is no usage information and no vignettes, we are stuck with using version 1.2.1. Looking through the changelog it looks like lots of good things have happened since then, and I'd love to be able to use version 4.0 but am completely in the dark about which commands to use...
"Compiled code should not write to stdout or stderr and C++ and Fortran I/O should not be used. As with the previous item such calls may come from external software and may never be called, but package authors are often mistaken about that." -- From CRAN.
If we follow the instruction, I suggest to copy a continued_fraction.cpp from preseq, remove all I/O functions and use the file as a part of preseqR. But this will generate duplicate codes.
Why is it named *.h and not *.hpp?
As I understand it, this function should pick one of ds or ztnb, whichever fits best. The problem is that if it selects ztnb, it doesn't return the same as ztnb. There is an obvious bug in the code that gives this effect, see below. Perhaps this is meant to be that way?
The functions differ because of a suspected bad variable assignment:
preseqR.rSAC <- function(n, r=1, mt=20, size=SIZE.INIT, mu=MU.INIT)
{
para <- preseqR.ztnb.em(n) ##this call is also different, all params are not sent in
shape <- para$size ##### HERE IT STARTS, THIS IS NOT ASSIGNED TO size
mu <- para$mu
if (shape <= 1) {
f.rSAC <- ds.rSAC(n=n, r=r, mt=mt)
} else {
## the population is close to be homogeneous
## the ZTNB approach is applied
## the probability of a species observed in the initial sample
p <- 1 - dnbinom(0, size = size, mu = mu) ########HERE size is used
## L is the estimated number of species in total
L <- sum(as.numeric(n[, 2])) / p
## ZTNB estimator
f.rSAC <- function(t) {
L * pnbinom(r - 1, size=size, mu=mu*t, lower.tail=FALSE) ########HERE size is used
}
}
return(f.rSAC)
}
I tried one extreme case I think nobody actually needs. When I use r=500
, the probability I get for 10
is -0.8972191, it should not be negative, right?
In cs.rSAC, I get the following warnings:
1: In f1 * t/f0 :
Recycling array of length 1 in vector-array arithmetic is deprecated.
Use c() or as.vector() instead.
2: In f0 * ppois(r - 1, f1 * t/f0) :
Recycling array of length 1 in array-vector arithmetic is deprecated.
Use c() or as.vector() instead.
3: In f0 * ppois(r - 1, f1 * t/f0) * exp(f1/f0) :
Recycling array of length 1 in vector-array arithmetic is deprecated.
Use c() or as.vector() instead.
4: In f0 + S - f0 * ppois(r - 1, f1 * t/f0) * exp(f1/f0) :
Recycling array of length 1 in array-vector arithmetic is deprecated.
Use c() or as.vector() instead.
The instructions you published in you paper are great, (https://static-content.springer.com/esm/art%3A10.1007%2Fs40484-015-0049-7/MediaObjects/40484_2015_49_MOESM1_ESM.pdf) but they no longer work with the new version. Do you have an update vignette? Thank you.
Need to be fixed:
-- Lines longer than 80 chars must be fixed
-- Comment starts need to be 3 # for not indented (e.g. for function definitions), 2 # for indented, and 1 # for right justified.
-- Put a blank line between appropriate code blocks, and put 2 between functions. Any time a comment is put in the code above some source, please include a whitespace line above that comment.
-- Put space around " - " and " + " in mathematical operations, but not around "*" or "/". Use formatting that highlights order of evaluation.
-- Make sure there are no tabs, and every indent is 2 spaces. Also there should be no trailing whitespace.
I think some of the files are needing removal or revision, and also there should be no nested preseqR inside preseqR.
If at all possible, the conditions for loop exit should be collected into the statement of the loop invariant. This makes it much easier for a reader to know when and why a loop will either continue or stop. An example is the break statement in the loop of line 630 in miscount.R. The counter and upper.limit should be checked in the statement of the while condition along with the "bootstrap.times > 0"
I got a lot of warnings about this when using preseqR.rSAC. My R version is 3.5.1, tested on Linux.
install.packages("
/Downloads/preseqR_1.0.1.tar.gz")/Downloads/preseqR_1.0.1.tar.gzโ is not available (for R version 3.1.1)
Warning message:
package โ
Your version of R is up to date
It seems to me like, even though R will now accept "=" as an assignment operator, it might be prudent and more consistent with most good R code, to use the original "<-" form of assignment. Does anyone have thoughts on this?
I am not sure whether or not we should export the interpolation and extrapolation functions. All these functions predict the number of distinct items. It confuses people when they want to predict. The core function is preseqR.continued.fraction.estimate, which constructs a continued fraction and estimate the number of distinct items by combining results from interpolation and extrapolation. I suggest to hind interpolation and extrapolation from users and just export preseqR.continued.fraction.estimate.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.