pneuvial / c3co Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 20.47 MB

Inferring cancer cell clonality from copy-number data

R 100.00%

c3co's People

Contributors

Stargazers

Watchers

Forkers

mpierrejean

c3co's Issues

TASK: Help pages for `get.W()` and `get.Z()`

Write help pages for non-exported functions:

get.W()
get.Z()

TESTS: "repeatability" tests

The package could need so "repeatability" tests, that is, tests that make sure results don't change between updates/versions as long as the input and methods are the same. The current set of tests are focused on making sure output is of proper data type and dimensions.

This will make it safe to optimize code, introduce parallelism (#37) etc. Right now we don't know if a small tweak breaks the results.

Wrapper for facets data

facets (https://github.com/mskcc/facets) seems to be an interesting method for producing PSCN estimates from WGS or WXS data. It would be nice to have a wrapper so that their PSCN estimates can be used by the package.

There is an example file in the facets package (see the package vignette)

datafile = system.file("extdata", "stomach.csv.gz", package="facets"); read.csv(datafile, nr=5)

CODE: Never test against class(x)

FYI, it's not safe to test against class(x), e.g.

    if (!is.null(segDat)) {
      if(class(segDat)=="character") {  ## <<==
        if (verbose) {
            message("Reading segmentation results from file: ")
            mprint(segDat)
        }
        seg <- readRDS(segDat)

Instead, when testing for class "membership", use inherits(x, class), e.g.

      if(inherits(segDat, "character")) {

Now, in this case one can equally well, and it's "cleaner", use:

      if (is.character(segDat)) {

I've fixed two cases like this in the code, cf. commit a30699d.

Annoying 'Note' displayed by function 'kronecker' about sparse matrices

> example(c3co)
Note: method with signature ‘dsparseMatrix#dsparseMatrix’ chosen for function ‘kronecker’,
target signature ‘dgCMatrix#dgTMatrix’.
"sparseMatrix#TsparseMatrix" would also be valid

S4: What was the reason/rationale for using S4?

I probably asked before but I don't remember the answer; what's the reason for using the S4 class system (rather than S3)? It don't see anything that is making use of S4-unique features.

ROBUSTNESS: Partial element names

Moved to mpierrejean/jointseg#7

fitC3co(): Use a good old for loop instead of while loop?

In fitC3co() there's the following while loop:

    it <- 1
    pp <- nb.arch[it]
    cond <- FALSE  ## condition for (early) stopping
    fitList <- allRes <- allLoss <- list()
    bestConfigp <- allConfig <- NULL
    while (!cond) {
        if (verbose) mprintf("Number of latent features: %d\n", pp)

        ## Initialization
        bestRes <- NULL
        bestBIC <- -Inf
        bestConfig <- aConf <- NULL
        ## Z0 are initialized with the centered version of the data
        Z0 <- initializeZ(Yc$Y1, Yc$Y2, p=pp, ...)
        if (verbose) {
            mprintf("Parameter configuration: (%s)\n",
                    comma(colnames(configs)))
        }
        allRes[[pp]] <- list()
        allLoss[[pp]] <- list()
        for (cc in seq_len(nrow(configs))) {
            cfg <- configs[cc, , drop = TRUE]
            if (verbose) {
              mprintf(" - configuration #%d (%s) of %d: ",
                  cc, 
                  comma(sprintf("%s = %g", names(cfg), cfg)),
                  nrow(configs))
            }

              
            ## THIS is the function that costs comme computation 
            if (verbose) mprintf("fused lasso")
            if (is.null(Y2)) cfg <- cfg["lambda1"]  ## Should this be an error?
            res <- positiveFusedLasso(Y = Y, Z = Z0, lambda = cfg)
            
            if (verbose) mprintf(", BIC = ")
            stats <- modelFitStatistics(Reduce(`+`, Y), res@E$Y, res@W, res@S$Z)
            BIC <- stats[["BIC"]]
            if (verbose) mprintf("%g", BIC)
            aConf <- c(pp, cfg, stats[["PVE"]], BIC, stats[["logLik"]], stats[["loss"]])
            ## replace above line by: aConf <- stats
            allConfig <- rbind(allConfig, aConf)
            allRes[[pp]][[cc]] <- res
            allLoss[[pp]][[cc]] <- stats[["loss"]]
            
            if (BIC > bestBIC) { ## BIC has improved: update best model
                bestRes <- res
                bestBIC <- BIC
                bestConfig <- aConf
                if (verbose) mprintf("*")
            }
            if (verbose) mprintf("\n")
        }

        fitList[[it]] <- bestRes
        bestConfigp <- rbind(bestConfigp, bestConfig)
        ## sanity check: minor CN < major CN in the best parameter
        ## configurations (not for all configs by default)
        if (!is.null(Y2) && warn) {
            Z <- slot(bestRes, "S")
            dZ <- Z$Z2 - Z$Z1
            tol <- 1e-2  ## arbitrary tolerance...
            if (min(dZ) < -tol) {
                warning("For model with ", pp, " features, some components in minor latent profiles are larger than matched components in major latent profiles")
            }
        }
        
        it <- it + 1
        pp <- nb.arch[it]
        ## stop if pp has reached the max of its grid
        cond <- is.na(pp)
    }

The only way I can see this while loop working is that it loops of all values of nb.arch; there is no early stopping or similar. Is this a remnant from old days? Something like:

for (it in seq_along(nb.arch)) {
  pp <- nb.arch[it]
  ...
}

BUG FIX(?): Was it your intention to export the S4 classes?

In commit 9395db, I just fixed:

diff --git a/NAMESPACE b/NAMESPACE
index ba237a4..bbdf326 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -15,6 +15,8 @@ export(mixSubclones)
 export(positiveFusedLasso)
 export(pvePlot)
 export(segmentData)
+exportClasses(c3coFit)
+exportClasses(posFused)
 exportMethods(Wplot)
 exportMethods(createZdf)
 exportMethods(showC3coFit)
diff --git a/R/AllClasses.R b/R/AllClasses.R
index 7241ce0..5a71bcd 100644
--- a/R/AllClasses.R
+++ b/R/AllClasses.R
@@ -22,8 +22,8 @@
 #'
 #' @exportClass posFused 
 setClass(
-  "posFused",
-  representation(
+  Class = "posFused",
+  representation = representation(
     S = "list",
     S0 = "list",
     W = "matrix",
@@ -43,11 +43,10 @@ setClass(
 #' 
 #' @slot fit A List of [\code{\linkS4class{posFused}}] objects.
 #' 
-#' @exportClass c3coClass
-#'
+#' @exportClass c3coFit
 setClass(
-  "c3coFit",
-  representation(
+  Class = "c3coFit",
+  representation = representation(
     bkp = "list",
     segDat = "list",
     fit = "list"
diff --git a/R/AllGenerics.R b/R/AllGenerics.R
index c2baf0e..c234b4f 100644
--- a/R/AllGenerics.R
+++ b/R/AllGenerics.R
@@ -83,7 +83,7 @@ setMethod(
         rownames(W) <- rownamesW

Progress/vebose/debug messages: always use message() - never cat() or print()

$ grep -E "(cat|print)[(]" R/c3co.R
            print("Reading segmentation results from file: ")
            print(segDat)
        print("Segmented data is provided, skip segment step")

Input data format

The current input format for the c3co function involves a list of data.frames with 4 columns (total CN, DH, position, chromosome), and an argument 'stat' to explicitly specify whether to work on total CN only or jointly on total CN, DH :

function (dat, stat = c("C1C2", "TCN"), ...)

This is redundant (position and chromosome should be the same for all elements of the list)
It's not obious to apply the function when DH is not available (one has to add a dummy 'DH' column to all the elements in the list
(related) the position information is currently not used

It seems to me that a more efficient and user-friendly signature for this function would be:

function (CN, DH=NULL, chromosome=NULL, position=NULL, ...)

where :

'CN' is a numeric matrix
'DH' is an optional numeric matrix (if not provided, c3co is run on CN only)
'chromosome' is an optional vector (if not provided, we assume one chromosome)
'position' is an optional vector (if not provided, we assume that the input data are already sorted)

Broken vignette

The vignette gives an error when trying to install the package in R 3.3.2 on Linux:

> devtools::install(build_vignettes=TRUE)
Installing InCaSCN
Using GitHub PAT from envvar GITHUB_PAT
'/usr/lib/R/bin/R' CMD build '/home/hb/repositories/others/incas-cn' --no-resave-data --no-manual
* checking for file ‘/home/hb/repositories/others/incas-cn/DESCRIPTION’ ... OK
* preparing ‘InCaSCN’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
  Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R Markdown v1.
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5

Loading required package: jointseg
Loading required package: acnr
Loading required package: R.utils
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.21.0 (2016-10-30) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'

The following objects are masked from 'package:methods':

    getClasses, getMethods

The following objects are masked from 'package:base':

    attach, detach, gc, load, save

R.utils v2.5.0-9000 (2016-11-07) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'

The following object is masked from 'package:utils':

    timestamp

The following objects are masked from 'package:base':

    cat, commandArgs, getOption, inherits, isOpen, parse, warnings

Loading required package: xtable
Loading required package: matrixStats
matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.
Loading required package: parallel
Quitting from lines 52-55 (vignette.Rmd) 
Error: processing vignette 'vignette.Rmd' failed with diagnostics:
'tumorFraction' should be in c(0.14, 0.34, 0.5, 0.79, 1) for dataSet GSE11976
Execution halted
Error: Command failed:
 '/usr/lib/R/bin/R' CMD build '/home/hb/repositories/others/incas-cn' --no-resave-data --no-manual 
 
In addition: Warning message:
In readLines(file) :
  incomplete final line found on '/home/hb/repositories/others/incas-cn/DESCRIPTION'
>

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] httr_1.2.1           backports_1.0.4      R6_2.2.0            
 [4] rprojroot_1.1        tools_3.3.2          withr_1.0.2         
 [7] curl_2.2             crayon_1.3.2         memoise_1.0.0       
[10] desc_1.0.1           knitr_1.15           callr_1.0.0.9000    
[13] git2r_0.15.0         pkgload_0.0.0.9000   digest_0.6.10       
[16] pkgbuild_0.0.0.9000  devtools_1.12.0.9000

Tumor content > 100

The function getWeightMatrix sometimes returns proportions larger than 100!

I haven't digged in the code to see why this happens, but obviously it should not happen Reproducible example:

> library("c3co")
> set.seed(48)
> M <- getWeightMatrix(100, 0, 3, 15, sparse.coeff=0.7, contam.coeff=0.6, contam.max=2);
> rowSums(M)
 [1]  90 100 100  95 345 100  70 100  85 100  90 100  90  95  95

TASK: Turn c3co() into generic function

Instead of having to specify either argument dat or segDat for c3co(), e.g.

fit <- c3co(dat = NULL, parameters.grid, segDat = segDat)

by turning c3co() into a generic function we can write c3co() S3 methods methods for dat and segDat classes (Issue #30) so it can be called as:

fit1 <- c3co(dat, parameters.grid)
fit2 <- c3co(segDat, parameters.grid)

and S3 methods dispatching with make sure the proper method is called.

BUG: cat(str(x))

FYI, using cat(str(...)) is most likely a bug, e.g. from https://github.com/pneuvial/c3co/blob/master/R/AllGenerics.R#L22-L26:

        cat("Subclones\n")
        cat(str(this@S), "\n")
        cat("Weights\n")
        cat(str(this@W), "\n")
        cat("Estimates\n")
        cat(str(this@E), "\n")
        [...]
        cat("param\n")
        cat(str(this@param), "\n")

Add test scripts

unit test for each function using simulated data

Plot functions

Write a generic function to plot for latent profiles TCN, C1 and C2.
Write a generic function to plot weights: this function can also perform the clustering on patients.

TASK: Preserve sample and chromosome names

We should try to preserve sample and chromosome names throughout the pipeline, if they exist in the input data.

I've just added the C3coSegmentation class for objects returned by segmentData(), e.g.

> print(seg)
C3coSegmentation: 
 Method: jointseg::jointSeg
 Number samples: 3
 Number chromosomes: 24
 Number segments: 400
 Dimensions: [3] 'Y1', 'Y2', 'Y'

Ideally we could also report (and subset on) sample and chromosome names, e.g.

> print(seg)
C3coSegmentation: 
 Method: jointseg::jointSeg
 Samples: [3] 'SampleA', 'SampleB', 'SampleC'
 Chromosomes: [24] '1', '2', ..., 'Y'
 Number segments: 400
 Dimensions: [3] 'Y1', 'Y2', 'Y'

This will also be handy when plotting c3co results.

PARALLELIZATION: fitC3co() is a good candicate for parallelization

The grid search in fitC3co():

    for (it in seq_along(nb.arch)) {
        pp <- nb.arch[it]
        if (verbose) mprintf(" - Iteration #%d (%d latent features) of %d ...\n", it, pp, length(nb.arch))
        [...]
        for (cc in seq_len(nrow(configs))) {
            cfg <- configs[cc, , drop = TRUE]
            [...]
            res <- positiveFusedLasso(Y = Y, Z = Z0, lambda = cfg)
            [...]
        } ## for (cc ...)  
    } ## for (it ...)

should be a good candidate for parallelization.

From a quick looks it seems to just go through all combinations of (nb.arch, configs) in a grid search manner where each point is optimized independently of the others - and there are no early-stopping conditions taking place. If so, it should be possible to restructure the code such that each point can run in parallel and only at the end we'll identify what the "best" config is for each nb.arch. This means that we can use future:s, so that the user can choose to parallelize the grid points on the local machine, as individual jobs on a HPC scheduler, etc.

TASK: Add tests of `get.W()` and `get.Z()` where results have been validated analytically

Add tests of get.W() and get.Z() where results have been validated analytically

STATUS: Should I be using the develop branch?

Hi, coming back to this again and finally might give it a real go. Should I be using the develop branch?

NOTATION: Rename slot 'S' and 'S0' of 'posFused' to 'Z' and 'Z0'

setClass(
    Class = "posFused",
    representation = representation(
        Y  = "list",        ## original signal
        S  = "list",        ## subclones (aka Z)
        S0 = "list",        ## initial estimates of subclones (aka Z0)
        W  = "matrix",      ## weights
        mu = "numeric",     ## intercept
        E  = "list",        ## signal reconstruction (aka Yhat)
        params  = "numeric",
        failure = "logical"
    )
)

Merge `intercept` into `develop`?

Hi, I've done quite a few cleanups etc to the intercept branch. Does it make sense to simple merge it all back to the develop branch?

CLEANUP: Move all inst/extdata/ (15 MB) to separate package

Move all inst/extdata/ (15 MB) to separate package, e.g. InCaSCN.exdata. Advantages:

This example data package is likely to change much less often.
The InCaSCN package will be much smaller, which makes uploads via e.g. rhub::check() much quicker.
It's enough for InCaSCN to Suggests: InCaSCN.exdata.

ROBUSTNESS: upgrade positiveFusedLasso() warning to an error?

Should the warning "Under-identified problem: more latent features than samples" in positiveFusedLasso() be an error instead?

positiveFusedLasso <- function(Y, Z, lambda, eps=1e-1,
                               max.iter=50, warn=FALSE, verbose=FALSE) {

  ## problem dimensions
  M <- length(Y)
  p <- ncol(Z[[1]])  ## number of subclones/archetypes/latent features

  stopifnot(is.numeric(lambda), length(lambda) == M, length(Z) == M)
  if (p > nrow(Y[[1]])) {
    warning("Under-identified problem: more latent features than samples")
  }

Avoid saving segmented data to disk

Now that the c3co function can take segmented data as input, we should modify the PSCBS and facets wrappers so that they don't save segmented data to file but rather return it (it is light enough by definition). The examples and the PSCBS vignette should be updated accordingly.

Alternative segmentations

Currently the segmentation method used in the package is hard-coded, cf:

resSeg <- jointseg::jointSeg(Y=dataToSeg[ww,],K=100, modelSelectionMethod="Birge")

https://github.com/pneuvial/c3co/blob/master/R/segmentData.R#L56

It would be great to allow other segmentation methods/parameters to be used.
As a first step toward this, one could start by writing a getSegmentData function that takes as input the original data and a set of breakpoint positions, and returns the corresponding segment-level data.

FEEDBACK: getWeightMatrix()

Feedback / questions about:

getWeightMatrix(prop.max, prob.min, nb.arch, nb.samp, sparse.coeff = 0.5, contam.coeff = 0.6, contam.max = 30)

#' Generate weight matrix
#'
#' @param prop.max An integer that is the maximim proportion for a present archetype
#' @param prob.min An integer that is the minimum proportion  for a present archetype
#' @param nb.arch  An interger that is the number of archetypes (number of columns)
#' @param nb.samp An interger that is the number of samples (number of rows)
#' @param contam.max An interger between 0 and 100 that controls the maximal level of contamination by normal cells
#' @param contam.coeff A numeric between 0 and 1 that control the contamination level
#' @param sparse.coeff A numeric between 0 and 1 that control the sparsity by rows
#' @return A matrix of weights (the sum of rows is equal to 1)

TYPO: prop.max vs prob.min
What's the range of the above two arguments? (0,1,2,...,100)?
Why integers and not fraction in [0,1]? Same question for contam.max.
Several spelling errors.
A bit random spacing.

Use futures

Make use of "futures" :

to parallelize segmentation of different chromosomes in 'getSegmentData'
~~to parallelize estimation of Z1 and Z2 in 'positiveFusedLasso'~~

SPEEDUP: get.W() performance tweaks

Save for later, i.e. after test functions are in place:

diff --git a/R/get.W.R b/R/get.W.R
index 49570ce..08e5bd1 100644
--- a/R/get.W.R
+++ b/R/get.W.R
@@ -25,14 +25,19 @@
 #' @importFrom limSolve lsei
 get.W <- function(Z, Y) {
   J <- ncol(Z)
+  
+  E <- matrix(rep(1, times = J), nrow = 1L)
+  H <- rep(0, times = J)
+  G <- diag(J)
+  
   t(apply(Y, MARGIN = 1L, FUN = function(y) {
     lsei(
       A = Z,
       B = y,
-      E = matrix(rep(1, times = J), nrow = 1L),
+      E = E,
       F = 1,
-      H = rep(0, times = J),
-      G = diag(J),
+      H = H,
+      G = G,
       type = 2L)$X
   }))
 }

Installation instructions so that vignettes are built

devtools::install_github("pneuvial/c3co", build_vignettes =TRUE)

Don't understand why that's not the default :/

ROBUSTNESS: Assert finite estimates on W and Z

Add sanity checks on the estimates of W and Z:

W: all(is.finite(W)), all(W >= 0 & W <= 1)
Z: all(is.finite(Z)) - or does this depend on input Y?

NOTE: FACETS example broken (by the 'facets' package?)

The data file we have been using for facetsWrapper (now?) only contains a single sample. Therefore it is not possible (currently, and not useful, anlytime) to fit the model on this data.

Negative PVE

Sometimes algorithm produces negative PVE. Try to understand why? May be a problem of convergence.
This is an example which produces a negative PVE.

dataAnnotTP <- acnr::loadCnRegionData(dataSet="GSE11976", tumorFrac=1)
dataAnnotN <- acnr::loadCnRegionData(dataSet="GSE11976", tumorFrac=0)
len <- 500*10
nbClones <- 3
set.seed(88)
bkps <- list(c(100, 250)*10, c(150, 400)*10, c(150, 400)*10)
regions <- list(c("(1,2)", "(0,2)", "(1,2)"),
c("(0,3)", "(0,1)", "(1,1)"), c("(0,2)", "(0,1)", "(1,1)"))
datSubClone <- buildSubclones(len, dataAnnotTP, dataAnnotN,
                               nbClones, bkps, regions)
M <- rSparseWeightMatrix(15, 3, 0.7)
dat <- mixSubclones(subClones=datSubClone, M)
l2=0.1
parameters.grid.2 <- list(lambda=l2, nb.arch=2:6)
resC <- c3co(dat, stat="C1C2", parameters.grid.2)
resC@config$best

Error in cutree(hc, k = p) : elements of 'k' must be between 1 and 2

Using c3co@intercept with in-house data (cannot share here; only offline), I'm getting:

> fit <- c3co(dat = NULL, parameters.grid, segDat = segDat, verbose = TRUE)
Segmented data is provided, skip segment step
fitC3co() ...
Only one regularization parameter is provided. Using the same value for lambda[1] and lambda[2] : 
 num [1:7] 1e-04 2e-04 3e-04 4e-04 5e-04 6e-04 7e-04
 - Iteration #1 (2 latent features) of 4 ...
   + Parameter configuration: (lambda1, lambda2)
   + configuration #1 (lambda1 = 0.0001, lambda2 = 0.0001) of 7: fused lasso, BIC = -2376*
   + configuration #2 (lambda1 = 0.0002, lambda2 = 0.0002) of 7: fused lasso, BIC = -1356.38*
   + configuration #3 (lambda1 = 0.0003, lambda2 = 0.0003) of 7: fused lasso, BIC = -1437.48
   + configuration #4 (lambda1 = 0.0004, lambda2 = 0.0004) of 7: fused lasso, BIC = -1408.56
   + configuration #5 (lambda1 = 0.0005, lambda2 = 0.0005) of 7: fused lasso, BIC = -1408.51
   + configuration #6 (lambda1 = 0.0006, lambda2 = 0.0006) of 7: fused lasso, BIC = -1414.42
   + configuration #7 (lambda1 = 0.0007, lambda2 = 0.0007) of 7: fused lasso, BIC = -1401.05
 - Iteration #1 (2 latent features) of 4 ... DONE
 - Iteration #2 (3 latent features) of 4 ...
Error in cutree(hc, k = p) : elements of 'k' must be between 1 and 2

Enter a frame number, or 0 to exit   

 1: source("main.R", echo = TRUE)
 2: withVisible(eval(ei, envir))
 3: eval(ei, envir)
 4: eval(ei, envir)
 5: main.R#17: c3co(dat = NULL, parameters.grid, segDat = segDat, verbose = TRUE)
 6: fitC3co(Y1, Y2 = Y2, parameters.grid = parameters.grid, ..., verbose = verbos
 7: initializeZ(Yc$Y1, Yc$Y2, p = pp, ...)
 8: t(initZ(Y1, p))
 9: initZ(Y1, p)
10: cutree(hc, k = p)
11: stop(gettextf("elements of 'k' must be between 1 and %d", n), domain = NA)
12: (function () 
{
    utils::recover()
})()

With c3co@develop I get:

> fit <- c3co(dat = NULL, parameters.grid, segDat = segDat, verbose = TRUE)
[1] "Segmented data is provided, skip segment step"
Only one regularization parameter is provided. Using the same value for lambda[1] and lambda[2] : 
 num [1:7] 1e-04 2e-04 3e-04 4e-04 5e-04 6e-04 7e-04
Number of latent features: 2
Parameter configuration: (lambda1; lambda2)
1e-04; 1e-04
2e-04; 2e-04
3e-04; 3e-04
4e-04; 4e-04
5e-04; 5e-04
6e-04; 6e-04
7e-04; 7e-04
Warning in fitC3co(Y1, Y2 = Y2, parameters.grid = parameters.grid, ...,  :
  For model with 2 features, some components in minor latent profiles are larger than matched components in major latent profiles
Number of latent features: 3
Error: p <= n is not TRUE

Enter a frame number, or 0 to exit   

 1: source("main.R", echo = TRUE)
 2: withVisible(eval(ei, envir))
 3: eval(ei, envir)
 4: eval(ei, envir)
 5: main.R#17: c3co(dat = NULL, parameters.grid, segDat = segDat, verbose = TRUE)
 6: fitC3co(Y1, Y2 = Y2, parameters.grid = parameters.grid, ..., verbose = verbose)
 7: initializeZ(Y1c, Y2 = Y2c, p = pp, ...)
 8: stopifnot(p <= n)
 9: stop(msg, call. = FALSE, domain = NA)

Any ideas?

Facets -> FACETS

It's FACETS with all upper case. I've updated the help, but c3co::Facetswrapper() should probably also be renamed accordingly.

TASK: Unit tests for `get.W()`, `get.Z()`, and `positiveFusedLasso()`

Write unit tests for get.W(), get.Z(), and positiveFusedLasso():

PERFORMANCE: Use rowAnyNAs(x) instead of (rowSums(is.na(x)) > 0)

The following rowSums(is.na(x)) > 0 statements can be replaced by rowAnyNAs(x) of matrixStats (>= 0.52.0):

$ grep -F "rowSums(" R/segmentData.R 
            idxNAC1 <- which(rowSums(is.na(binDatTCN)) > 0)
            idxNAC2 <- which(rowSums(is.na(binDatDH)) > 0)
            idxNA <- which(rowSums(is.na(binDatTCN)) > 0)

Z1 <= Z2 ?

In https://github.com/pneuvial/incas-cn/blob/master/R/getZ.R#L17-L23 we are taking care of the situation where minor CN > major CN in the latent profiles. Does this happen? Hopefully not. If not, then we should rather throw an error than silently swapping minor and major CNs.

TASK: Add assertions/sanity checks on input and output dimensions

Add stopifnot()-like assertions to validate consistency of input dimensions:

Add stopifnot()-like assertions to validate consistency of output dimensions:

Unused argument in 'loadPSCBSdata'

The argument 'pattern' in loadPSCBSdataand PSCBSwrapper is currently not used. Drop it?

Wrapper for PSCBS data

Write a wrapper so that PSCBS data files can be easily processed using the package.

HELP: help("mixSubclone") mentions: A matrix of weights in percentage [0,100]

mixSubclone() takes weights in [0,1], correct? That is, [0,100] is not correct?

POTENTIAL BUG: partial match of 'Z' to 'Z1'

Not sure if testthat does it automatically, but I have the following in my ~/.Rprofile:

options(warnPartialMatchArgs = TRUE, warnPartialMatchAttr = TRUE, warnPartialMatchDollar = TRUE)

and then I get lots of warnings:

> devtools::test()
Loading c3co
Loading required package: testthat

Attaching package: 'testthat'

The following object is masked from 'package:purrr':

    is_null

Warning: S3 method 'segmentData.list' was declared in NAMESPACE but not found
Testing c3co
Warning: `encoding` is deprecated; all files now assumed to be UTF-8
✔ | OK F W S | Context
✔ | 42       | Construction of subclones [0.6 s]
✔ |  0   10   | c3co test [5.1 s]
──────────────────────────────────────────────────────────────────────────────────
test_c3co.R:38: warning: c3co terminates on C1,C2
partial match of 'coefficient' to 'coefficients'

test_c3co.R:38: warning: c3co terminates on C1,C2
partial match of 'bkp' to 'bkpList'

test_c3co.R:38: warning: c3co terminates on C1,C2
partial match of 'bkp' to 'bkpList'

test_c3co.R:38: warning: c3co terminates on C1,C2
For model with 2 features, some components in minor latent profiles are larger than matched components in major latent profiles

test_c3co.R:38: warning: c3co terminates on C1,C2
For model with 4 features, some components in minor latent profiles are larger than matched components in major latent profiles

test_c3co.R:38: warning: c3co terminates on C1,C2
For model with 5 features, some components in minor latent profiles are larger than matched components in major latent profiles

test_c3co.R:38: warning: c3co terminates on C1,C2
For model with 6 features, some components in minor latent profiles are larger than matched components in major latent profiles

test_c3co.R:45: warning: c3co terminates on TCN
partial match of 'coefficient' to 'coefficients'

test_c3co.R:45: warning: c3co terminates on TCN
partial match of 'bkp' to 'bkpList'

test_c3co.R:45: warning: c3co terminates on TCN
partial match of 'bkp' to 'bkpList'
──────────────────────────────────────────────────────────────────────────────────
✔ | 20       | Toy data generation
✔ | 32   15   | Initialization of the latent features [95.9 s]
──────────────────────────────────────────────────────────────────────────────────
test_initializeZ.R:16: warning: (unknown)
partial match of 'coefficient' to 'coefficients'

test_initializeZ.R:16: warning: (unknown)
partial match of 'bkp' to 'bkpList'

test_initializeZ.R:16: warning: (unknown)
partial match of 'bkp' to 'bkpList'

test_initializeZ.R:36: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:38: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:39: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:36: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:38: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:39: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:36: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:38: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:39: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:36: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:38: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

test_initializeZ.R:39: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'
──────────────────────────────────────────────────────────────────────────────────
⠋ |  0 1     | c3co test on PSCBS data[1] "Segmented data is provided, skip segment step"
✖ |  0 1   1 | c3co test on PSCBS data [55.7 s]
──────────────────────────────────────────────────────────────────────────────────
test_PSCBS.R:7: error: Produce (C1,C2) segmentation from PSCBS locus-level data
argument is not interpretable as logical
1: PSCBSwrapper(PSCBSdata, stat = "C1C2") at /home/hb/repositories/c3co/tests/testthat/test_PSCBS.R:7

test_PSCBS.R:16: skip: c3co terminates on PSCBS locus-level data
Empty test
──────────────────────────────────────────────────────────────────────────────────
✔ | 40       | Construction of weight matrices
⠹ | 40   3   | Construction of weight matrices
══ Results ═══════════════════════════════════════════════════════════════════════
Duration: 158.6 s

OK:       134
Failed:   1
Warnings: 28
Skipped:  1

I've already reported on some of these:

but the:

test_initializeZ.R:39: warning: output of initializeZ has correct dimensions and contents
partial match of 'Z' to 'Z1'

ones looks potentially concerning.

Avoid relying on 'minMaxPos'

Currently 'createZdf' requires an argument 'minMaxPos', a matrix containing min and max position for each chromosome. This should really be taken care of by the package, not by the user.

If we add to the slot 'bkp' of a fitC3co object a "first breakpoint" at the first position and a "last breakpoint" at the last position, then the object will be selfcontained.

TASK: Rename variable `Z` to `Zt` where it is actually `t(Z)`

TASK: Rename Z fields Zt fields, e.g. `Z1` -> `Z1t`

TASK: Rename Z fields Zt fields, e.g. Z1 -> Z1t

NOTATION: Use same notation for dimension everywhere

Use same notation for dimension everywhere:

n: Number of samples
K: Number of subclones
J: Number of segments
M: Number of signal dimensions; M = {1,2}

Commonly used in:

Y: a n x J matrix
W: a n x K matrix
Z: a K x J matrix
Zt = t(Z): a J x K matrix

For a starter, there are several locations where p and L are used.

BUG: 'length(x) = 2 > 1' in coercion to 'logical(1)

in R-devel there are now run-time checks for non-scalar conditions, e.g. if (1:2 == 1). I've enabled those checks on Travis and sure enough we get (https://travis-ci.org/pneuvial/c3co/jobs/495017840):

Error: processing vignette 'c3co-sim-1d.Rmd' failed with diagnostics:
'length(x) = 2 > 1' in coercion to 'logical(1)'

when building the vignettes.

HELP: Some of the help pages need some more love

Several of the help pages need some more love, e.g. more details on input data types and return data types. That type of information is really helpful for new comers when it comes to troubleshooting and understand what's been passed around.