smithlabcode / preseqr Goto Github PK

View Code? Open in Web Editor NEW

10.0 8.0 3.0 1.59 MB

An R package for estimating library complexity

R 100.00%

capture-recapture sequencing sequencing-coverage r r-package library-complexity sequencing-library

preseqr's People

Contributors

Stargazers

Watchers

Forkers

vpetukhov 001101 andrewdavidsmith

preseqr's Issues

Use of underscore characters in R?

I think this is strongly discouraged, because in some (all??) versions of R, the underscore character can be interpreted as assignment. The typical replacement is to use a dot ('.'). Do you agree @timydaley ?

Outputs of preseqR.rSAC and preseqR.rSAC.bootstrap

Maybe a stupid idea. Since preseqR.rSAC.bootstrap is bootstrap version of preseqR.rSAC, I expected the returned object to behave similarly. Now estimator returned by preseqR.rSAC is a function and that returned by preseqR.rSAC.bootstrap is a list, you need to use f inside to take the value of r

test code for ZTNB

Is a very bad name for a file...

Use of os functions

When working with paths, for example joining paths, use the "os" stuff. Like this:

import os
fullPath = os.path.join(srcPath, cppPath)

and make sure not to use the stuff in shutil when more standard Python is easily used instead. Also, in Python it is typical to use camel-case, and to start variables with lowercase.

For replacing the "hpp" with "h" the best way might involve a simple regex replace.

NA/Inf replaced by maximum positive value

When using any functions associated with the Fisher estimator (fisher.alpha, fisher.rSAC), I get these warnings:

1: In uniroot(function(x) (exp(x) - 1)/x - N/S, c(0.001,  ... :
  NA/Inf replaced by maximum positive value
2: In uniroot(function(x) (exp(x) - 1)/x - N/S, c(0.001,  ... :
  NA/Inf replaced by maximum positive value
...```

Tutorial in README.md

I think it's might be good to add a short tutorial on how to run the example dataset in README so that user can easily try out the example dataset after installation.

Binary file got added

The *.so file should not be in the repo. What command put it there?

Things here that do not belong

Does ZTNB belong in this directory? Also, typically tar.gz should not be included in the repo.

Relation between preseqR and preseq_dev

It seems like everything from preseqR should be included as a submodule of preseq_dev. Does anyone have thoughts on this? @chaodengusc @timydaley

"Recycling array of length 1 in vector-array arithmetic is deprecated" warnings where as.vector / as.numeric change output

kmer.frac.curve.bootstrap and preseqR.optimal.sequencing both give the error Recycling array of length 1 in vector-array arithmetic is deprecated. However, when I try to convert the value using as.numeric() or as.vector(), the output changes. Need to investigate more deeply.

Design seems completely wrong

It is not clear why so much existing code is duplicated in these files. It does not make sense that the only way to include other code from smithlab_cpp and from preseq is to actually copy it into new files.

results from bootstrap.complexity.curve are quite different from preseq bootstrap

The speed of preseqR.continued.fraction.estimate and bootstrap.complexity.curve should no longer be an issue. The correctness of preseqR.continued.fraction.estimate is checked by comparing its results with preseq.

Implementation of bootstrap.complexity.curve is to resample histogram, run preseqR.continued.fraction.estimate and summarize all results to make estimation. However, results from this function are quite different from results in preseq, which causes suspicious. @timydaley could you check the code of bootstrap.complexity.curve?

Make preseqR public

Repo organization

I think some of the files are needing removal or revision, and also there should be no nested preseqR inside preseqR.

functions in R package are very slow

Given the same histogram plus same parameters to estimate continued fraction, the c++ function costs 0.27 sys time, while the R function cost 4.461 sys time. The difference is more than 20 times. It becomes intolerant when I try to use bootstrap to estimate the complexity curve as the time consuming are multiplied by the times of resampling.

Release an updated version of publication's supplementary data

Would it be possible to add a tutorial (requested in Issue #21 as well) in the form of the original publication's supplementary data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4885658/ ? Since there is no usage information and no vignettes, we are stuck with using version 1.2.1. Looking through the changelog it looks like lots of good things have happened since then, and I'd love to be able to use version 4.0 but am completely in the dark about which commands to use...

avoid stdout, stderr in c++

"Compiled code should not write to stdout or stderr and C++ and Fortran I/O should not be used. As with the previous item such calls may come from external software and may never be called, but package authors are often mistaken about that." -- From CRAN.

If we follow the instruction, I suggest to copy a continued_fraction.cpp from preseq, remove all I/O functions and use the file as a part of preseqR. But this will generate duplicate codes.

Name of header file

Why is it named *.h and not *.hpp?

preseqR.rSAC does not give the same result as either the ds or ztnb algorithms

As I understand it, this function should pick one of ds or ztnb, whichever fits best. The problem is that if it selects ztnb, it doesn't return the same as ztnb. There is an obvious bug in the code that gives this effect, see below. Perhaps this is meant to be that way?

The functions differ because of a suspected bad variable assignment:

preseqR.rSAC <- function(n, r=1, mt=20, size=SIZE.INIT, mu=MU.INIT)
{
para <- preseqR.ztnb.em(n) ##this call is also different, all params are not sent in
shape <- para$size ##### HERE IT STARTS, THIS IS NOT ASSIGNED TO size
mu <- para$mu

the population is heterogeneous

because the coefficient of variation is large $1 / sqrt(shape)$

if (shape <= 1) {
f.rSAC <- ds.rSAC(n=n, r=r, mt=mt)
} else {
## the population is close to be homogeneous
## the ZTNB approach is applied

## the probability of a species observed in the initial sample
p <- 1 - dnbinom(0, size = size, mu = mu) ########HERE size is used
## L is the estimated number of species in total
L <- sum(as.numeric(n[, 2])) / p
## ZTNB estimator
f.rSAC <- function(t) {
  L * pnbinom(r - 1, size=size, mu=mu*t, lower.tail=FALSE) ########HERE size is used
}

}
return(f.rSAC)
}

extreme input for preseqR.sample.cov

I tried one extreme case I think nobody actually needs. When I use r=500, the probability I get for 10 is -0.8972191, it should not be negative, right?

"Recycling array of length 1 in vector-array arithmetic is deprecated" warnings

In cs.rSAC, I get the following warnings:

1: In f1 * t/f0 :
  Recycling array of length 1 in vector-array arithmetic is deprecated.
  Use c() or as.vector() instead.

2: In f0 * ppois(r - 1, f1 * t/f0) :
  Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.

3: In f0 * ppois(r - 1, f1 * t/f0) * exp(f1/f0) :
  Recycling array of length 1 in vector-array arithmetic is deprecated.
  Use c() or as.vector() instead.

4: In f0 + S - f0 * ppois(r - 1, f1 * t/f0) * exp(f1/f0) :
  Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.

Documentation for v3

The instructions you published in you paper are great, (https://static-content.springer.com/esm/art%3A10.1007%2Fs40484-015-0049-7/MediaObjects/40484_2015_49_MOESM1_ESM.pdf) but they no longer work with the new version. Do you have an update vignette? Thank you.

Formatting code

Need to be fixed:

-- Lines longer than 80 chars must be fixed
-- Comment starts need to be 3 # for not indented (e.g. for function definitions), 2 # for indented, and 1 # for right justified.
-- Put a blank line between appropriate code blocks, and put 2 between functions. Any time a comment is put in the code above some source, please include a whitespace line above that comment.
-- Put space around " - " and " + " in mathematical operations, but not around "*" or "/". Use formatting that highlights order of evaluation.
-- Make sure there are no tabs, and every indent is 2 spaces. Also there should be no trailing whitespace.

Files and organization

I think some of the files are needing removal or revision, and also there should be no nested preseqR inside preseqR.

Use of "break" to exit a loop should be avoided

If at all possible, the conditions for loop exit should be collected into the statement of the loop invariant. This makes it much easier for a reader to know when and why a loop will either continue or stop. An example is the break statement in the loop of line 630 in miscount.R. The counter and upper.limit should be checked in the statement of the while condition along with the "bootstrap.times > 0"

Recycling array of length 1 in vector array arithmetic is deprecated

I got a lot of warnings about this when using preseqR.rSAC. My R version is 3.5.1, tested on Linux.

preseqR not compatible with newest version of R

install.packages("/Downloads/preseqR_1.0.1.tar.gz")
Warning message:
package ‘/Downloads/preseqR_1.0.1.tar.gz’ is not available (for R version 3.1.1)

Your version of R is up to date

Appropriate way to do assignment in R

It seems to me like, even though R will now accept "=" as an assignment operator, it might be prudent and more consistent with most good R code, to use the original "<-" form of assignment. Does anyone have thoughts on this?

interpolation/extrapolation

I am not sure whether or not we should export the interpolation and extrapolation functions. All these functions predict the number of distinct items. It confuses people when they want to predict. The core function is preseqR.continued.fraction.estimate, which constructs a continued fraction and estimate the number of distinct items by combining results from interpolation and extrapolation. I suggest to hind interpolation and extrapolation from users and just export preseqR.continued.fraction.estimate.