pbreheny / biglasso Goto Github PK

View Code? Open in Web Editor NEW

113.0 14.0 29.0 2.66 MB

biglasso: Extending Lasso Model Fitting to Big Data in R

Home Page: http://pbreheny.github.io/biglasso/

R 32.48% C 2.88% C++ 64.59% Shell 0.05%

lasso r out-of-core bigdata parallel-computing

biglasso's People

Contributors

Stargazers

Watchers

biglasso's Issues

are NCV penalties working correctly?

get latest version

devtools::install_github("pbreheny/biglasso")
#> Skipping install of 'biglasso' from a github remote, the SHA1 (d0d0ec2) has not changed since last install.
#> Use force = TRUE to force installation
library(biglasso)
#> Loading required package: bigmemory
#> Loading required package: Matrix
#> Loading required package: ncvreg
library(ncvreg)
library(glmnet)
#> Loaded glmnet 4.1-8
library(reprex)

data(colon)
X <- colon$X |> ncvreg::std()
X <- cbind(1, X)
xtx <- apply(X, 2, crossprod)
init <- rep(0, ncol(X)) # cold starts - use more iterations (default is 1000)
y <- colon$y
og_resid <- resid <- drop(y - X %*% init)
og_X <- X.bm <- as.big.matrix(X)

fit1b <- biglasso_fit(X.bm, y, lambda = 0.05, xtx=xtx, r = resid,
penalty = "MCP", max.iter = 10000)

fit2b <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
penalty = "MCP", max.iter = 10000)

these run fine...

tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.1)
#> ----- PASSED : <-->
#> call| tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.1)
tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.1)
#> ----- PASSED : <-->
#> call| tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.1)

...but the tests below fail -- does this signal a problem?

tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#> call| tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.01)
#> diff| Mean relative difference: 0.2640539
tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#> call| tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.01)
#> diff| Mean relative difference: 0.2418661
Created on 2024-04-20 with reprex v2.1.0

Default CV error contradictory documentation

This states that the cross-validation error for classification is the binomial deviance:
https://github.com/YaohuiZeng/biglasso/blob/4e98d83e5cd0dc6ddb9c417faa1673a10e19eca5/R/cv.biglasso.R#L7-L9
but the eval.metric parameter' documentation says it defaults to misclassification:
https://github.com/YaohuiZeng/biglasso/blob/4e98d83e5cd0dc6ddb9c417faa1673a10e19eca5/R/cv.biglasso.R#L22-L24

Do these two passages refer to different things? To me it sounds like they are contradicting.

Rigorous LASSO and causal inference

Hi.

I have two requests for future development:

I want to estimate causal LASSO on large datasets. Currently I do it with the hdm package. I think it internally uses glmnet, thus it quickly becomes infeasible as I increase the size of the data set. Could you incorporate a standardized procedure to estimate this?
In order to quickly manually implement a causal LASSO one can run a rigorous LASSO of X separately on Y and D, and then use the partialled-out results to estimate an average treatment effect. There is currently no option to estimate a rigorous LASSO in the package, could it be implemented?

Gal

user-defined eval.metric function

Fantastic package. Any chance you might allow the user to provide a custom function to cv.biglasso for eval.metric? You probably don't want to get into the weeds in terms of writing a new function to accommodate everyone's favorite loss function (e.g. AUC, PPV, IDI, etc) -- but only having misclassification error in the binary setting can be limiting.

I can't reproduce your issue posted on SO.

Concerning your question on SO, I can't reproduce your error because I don't know exactly what you've changed compared to the current master branch.
Could you commit the changes you've tried?
It would be easier to try to fix the problem.

Moreover, you should consider using devtools and roxygen2 for package development, it's really nice and elegant. Following the instruction in the book R packages makes it easy.

Code and data used to produce benchmarks?

Was just wondering if it would be possible to release the code you used to produce the given benchmarks? Just asking because this would help to compare with other methods, and have everything 100% reproducible... And providing direct access to the R formatted real-world datasets would also be handy, just to be sure that other people would be using the exact same data... Would this be possible?

Proper way to incorporate penalty.factor

Two questions/remarks:

I would have expected the same results (with rescaled lambda sequences)

data(colon); X.bm <- as.big.matrix(colon$X); y <- colon$y
fit.lasso  <- biglasso(X.bm, y, family = 'gaussian')
fit.lasso2 <- biglasso(X.bm, y, family = 'gaussian',
                       penalty.factor = rep(100, ncol(X.bm)))

Looking at the code of {glmnet}, it seems that they rescale the multiplicative factors (by dividing by their mean). Should {biglasso} do the same here?

What is the (implementation) problem with having some penalty factor as 0? (you don't allow unpenalized variables in the current version)

Removed from CRAN

biglasso got removed from CRAN today due to issues with function signatures differing in some files:
https://www.stats.ox.ac.uk/pub/bdr/LTO/biglasso.out

Bug in cv.biglasso() triggered by edge case

I ran into a minor issue when doing some testing (for which i use a very short lambda sequence): if ind if only of length 1, E and Y loose a dimension, which in turn causes apply() later to trip. I believe adding drop = FALSE would make the situation more robust.

ncore problem

Hi,

I am having some problem with the ncores option in biglasso() and cv.biglasso() function. I compiled biglasso on my mac enabling openMP (4 cores). However, when I am trying to reproduce the worked out example of the cv.biglasso() function usng ncores=4, I am getting an error

Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: could not find function "attach.big.matrix"
In addition: Warning messages:
1: In .Internal(gc(verbose, reset)) :
  closing unused connection 6 (<-localhost:11359)
2: In .Internal(gc(verbose, reset)) :
  closing unused connection 5 (<-localhost:11359)
3: In .Internal(gc(verbose, reset)) :
  closing unused connection 4 (<-localhost:11359)
4: In .Internal(gc(verbose, reset)) :
  closing unused connection 3 (<-localhost:11359)

I got the same error in our cluster where biglasso is compiled using intel MKL. Can anyone help?

Thanks,
Subhomoy

Automatically detect and cast variables to expected type(s)

I have bumped into issues a few times where the expected formats for the inputs to biglasso turn out to be invalid, so I'd like to propose a set of usability enhancements that many could benefit from.

Matrix format. Suggestion: Detect normal matrix and cast to numeric big.matrix internally. (Example: I'm used to using normal matrices, and integer matrices. The former causes an error that is not clear, and the latter is a recipe for a crash! #22 )
If matrix is not big.matrix already, use the non-disk-spinning cast automatically or let user pick ( kaneplusplus/bigmemory#97 )
Response format. Cast to numeric. (Example: I'm used to using factors for classification, but biglasso performs numeric operations on them even when doing classification. This leads to strange errors and does not let cv.biglasso complete correctly: #25 ).

If changing the current behavior is not a good idea, it may still be nice to report an error about types so the user can make the necessary adjustments themselves.

Thanks for considering!

Immediate crash on small problem

R MRO v3.5.3 and RStudio 1.2.1335 (both fresh installs), latest biglasso. Windows 10 (1903).

Biglasso works on the "colon" toy example with exact same script as below (except X and y are pulled out of the colon data object).

But, biglasso fails with a complete R crash (RStudio reports fatal error) with this:
minimal.gz
Contains a tiny matrix and outcome vector, X and y.

library(biglasso)
load('F:/minimal.gz')
X.bm <- as.big.matrix(X)
fit.bin.lasso <- biglasso(X.bm, y, family = "binomial")

Yet it works perfectly with glmnet:
fit.bin.lasso.glmnet <- glmnet(X, y, family = "binomial")

> str(X)
 int [1:242, 1:2000] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL
> str(y)
 num [1:242] 1 1 1 1 1 1 1 1 1 1 ...

Disabling intercept

Hi,

Thank you for maintaining and creating biglasso, it's been really helpful! We've run into an application where we need to disable the intercept, can you please provide some guidance on how to do this? (a hacky way to do this would be great even if its not standardly supported!)

.

standardize = FALSE ?

Is there a way for the biglasso() function to not standardize the data? What I'm hoping for is the equivalent of standardize =FALSE in glmnet. I'm using penalty.factor to specify varying penalties on groups of coefficients and wan't more flexibility to effectively standardize groups of coefficients by grouping via the penalty.factor argument instead of individually. However, to do so I need to disable the default of standardization.

Support methods to calculate AIC and BIC values of biglasso fit

I was wondering if it would be possible to also support the calculation of AIC and BIC values of a biglasso fit, similar to the way that the ncvreg package provides this (returning a vector of AIC and BIC values for each lambda value used). This would enable selection of the best lambda based on AIC or BIC, which is faster than based on cross validation.

does biglasso suport cox?

Hi,
I have seen in the vignette of biglasso that cox family isn't an option allowed for biglasso. Do you know a way to perform biglasso with cox model? If no, I'd be interested in making a collaboration possibly to develop a Cox biglasso model. I have read also that sparse matrix aren't available for Cox family in the glmnet package. Would using another techniques to allow Cox family conflict with the basic development of biglasso?
Thank you,

I have seen that #7 is very related to this as well.

Sparse matrices

Great work. Is it possible to extend package to allow sparse matrices as input?

Error when setting row.idx

When I use row.idx I get the error
Error in [<-(*tmp*, cv.ind == i, 1:res$nl, value = res$loss) : (subscript) logical subscript too long

Maybe this error is related to line 87 being commented?
y <- fit$y # this would cause error if eval.metric == "MAPE"

Code that causes error:
train <- sample(c(T,F), size, c(.5,.5), replace = TRUE)
fit$blasso <- cv.biglasso(X = x, y = y, row.idx = which(train), penalty = "lasso", family = "binomial", nfolds = 10)

If I take "row.idx = which(train)" out it runs without any errors.

Possible to set the intercept to 0

Hi.

Thanks for making biglasso, it works well with bigmemory and memory-mapped files.

Is there a way, or hack, to set the intercept to 0? In glmnet you can set intercept=FALSE and the intercept(s) are set to zero.

Will this package supports a weight vector?

Impressive package, would love to be able to see support for weights

Warning: stack imbalance

Using code as

set.seed(11)

mat <- matrix(sample(c(0, 1), 5000 * 500, replace = TRUE), nrow = 5000)
mat <- bigmemory::as.big.matrix(mat, type = "double", shared = TRUE)

res <- biglasso::cv.biglasso(
 mat, sample(c(TRUE, FALSE), 5000, replace = TRUE), family = "binomial",
 ncores = 4, nlambda = 50, verbose = TRUE
)

I get

Preprocessing start: 2021-09-20 21:29:25.000
Preprocessing end: 2021-09-20 21:29:25.000

-----------------------------------------------
Lambda 1. Now time: 2021-09-20 21:29:25.000
...
Lambda 49. Now time: 2021-09-20 21:29:27.000
Warning: stack imbalance in '<-', 2 then 21

Any my sessionInfo() is

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cluster/spack/apps/linux-centos7-x86_64/gcc-8.2.0/openblas-0.3.7-aapr6wps44sbmhbedmmigxjk5bhfy5p3/lib/libopenblas_haswell-r0.3.7.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 bigmemory_4.5.36    compiler_4.0.2     
[4] Matrix_1.2-18       biglasso_1.4.1      parallel_4.0.2     
[7] Rcpp_1.0.7          grid_4.0.2          ncvreg_3.13.0      
[10] lattice_0.20-41

Happy to share further info if requested!

Estimate run time

Hi,

Is there a way to estimate run time of biglasso()? I have a data set of ~341k individuals vs 640k SNPs. This took a days to run, resulting in coefficients for 45 lambdas. I have another data set that was started around the same time that consists of 341k individuals and 332k SNPs. It is on lambda 65 and still running.

So:

Is there a way to determine how many lambdas biglasso() will test? Particularly if I already know at what lambda produces the min RMSE and other parameters I put into the training (dfmax etc)?
If possible, a message when biglasso() starts processing that reports how many lambda it will test would be a useful addition.

Thanks again for such a useful R package!

Vince

Support for other families recognized by glmnet (poisson, multinomial, cox, mgaussian)

Great package! Are there any plans to extend it to also support the other families currently supported by glmnet, i.e. "poisson","multinomial","cox" and "mgaussian"? (I was currently working on a problem with big data that required the use of either multinomial or mgaussian)

Dealing with Extremely large datasets (750 GB)

Hi,

I have a very large dataset. The X matrix has 480k rows and around 205k cols. I have saved it as a filebacked.big.matrix and i am using attach.big.matrix to "load" at the time of performing the elastic net regression. Sadly, this is working really slow and took almost 2 days for 1 lambda (lambda is quite small). On this, i have 2 questions

Will it be faster if i load the data in memory? I can ask for upto 2-3TB on the server.
If yes, how can i load filebacked matrix in memory? I was trying using read.big.matrix, deepcopy etc but it doesnt look like it works
Any other suggestions on dealing with such a large dataset?

Thanks
Prateek Sasan

Segfault

I get the following error when using biglasso:

 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: col.idx + 1
 2: biglasso(X.bm, Y, screen = "SSR-BEDPP", ncores = 16, verbose = TRUE,     dfmax = 50000, output.time = TRUE, max.iter = 100)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault

It occurs when executing the following:

fit <- biglasso(X.bm, Y, screen = "SSR-BEDPP", ncores = 16, verbose=TRUE, dfmax=50000, output.time=TRUE, max.iter=100)

The data has 292074 features (SNPs) and 298769 individuals. biglasso runs for 92 lambdas. Excluding max.iter=100 makes the process run indefinitely, so we decided to set max.iter to 100 to see if we can get the training to complete (we assumed it ran indefinitely due be unable to converge on lambda 92 and thus run through all 1000 iterations, which we expect will take 5 days).

We have had no issues running biglasso on other similar datasets, albeit ones with many more individuals (~350,000) than features (~100,000).

I am using the stable version:

install.packages("biglasso")

Package versions:

R version 3.4.1 (2017-06-30) -- "Single Candle"
       Package     Version
            BH    1.66.0-1
    data.table      1.11.4
        ncvreg      3.10-0
          Rcpp     0.12.17
 RcppArmadillo   0.8.500.0
            BH    1.65.0-1
      biglasso       1.3-6
     bigmemory      4.5.33
 bigmemory.sri       0.1.3
    data.table    1.10.4-3
        ncvreg       3.9-1
          Rcpp     0.12.15
 RcppArmadillo 0.8.300.1.0

Thanks,

Vince

Lambda optimization with series of alpha values

Hi!

Currently, the script enables lambda parameter optimization for alpha = 1 (lasso) only. Is that correct? Is it possible to optimize the lambda.min value with crossvalidation using a range of alpha values (0, 0.05, 0.1, [...], 0.9, 0.95, 1.0)? This way I could compare lasso vs. ridge vs. elastic net approaches.

The code would look something like this
cv <- list()
a <- seq(0.0, 1.0, 0.05)
for (i in a) {
cv1 <- cv.biglasso(X, Y, family = "gaussian", nfolds = 5, alpha = i)
cv[i] <- cv1
}

data storage mode affects model fit -- why??

# TKP 
# April 2024 
# Objective: explore differences between model fits in-memory and filebacked
# remotes::install_github("YaohuiZeng/biglasso")
library(reprex)
library(biglasso)
#> Loading required package: bigmemory
#> Loading required package: Matrix
#> Loading required package: ncvreg
library(ncvreg)
# load colon data ----------------------------------------------------------------
data(colon)
X <- colon$X |> ncvreg::std()
# X <- cbind(1, X)
xtx <- apply(X, 2, crossprod)
init <- rep(0, ncol(X)) # cold starts - use more iterations (default is 1000)
y <- colon$y
resid <- drop(y - X %*% init)
X.bm <- as.big.matrix(X)

# take 1 -----------------------------
# file-backed model fit 
fit1 <- biglasso_fit(X = X.bm, y = y, lambda = 0.05, xtx = xtx, r = resid,
                     penalty = "lasso", max.iter = 10000)

# compare with `ncvreg::ncvfit()` -- runs in memory 
fit2 <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
               penalty = "lasso", max.iter = 10000)

# test coefficients 
tinytest::expect_equal(fit1$beta, fit2$beta, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#>  call| tinytest::expect_equal(fit1$beta, fit2$beta, tolerance = 0.01)
#>  diff| Mean absolute difference: 0.02752312

# test residuals
tinytest::expect_equal(fit1$resid, fit2$resid, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1$resid, fit2$resid, tolerance = 0.01)

str(fit1) # note: iter = 3773
#> List of 11
#>  $ beta          : Named num [1:2000] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "names")= chr [1:2000] "Hsa.3004" "Hsa.13491" "Hsa.13491.1" "Hsa.37254" ...
#>  $ iter          : int 3773
#>  $ resid         : num [1:62] 0.987 0.617 0.905 0.202 0.793 ...
#>  $ lambda        : num 0.05
#>  $ penalty       : chr "lasso"
#>  $ alpha         : num 1
#>  $ loss          : num 29.5
#>  $ penalty.factor: num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ n             : num 62
#>  $ y             : num [1:62] 1 0 1 0 1 0 1 0 1 0 ...
#>  $ time          : num 0.35
#>  - attr(*, "class")= chr [1:2] "biglasso" "ncvreg"
str(fit2) # note: iter = 2
#> List of 10
#>  $ beta          : Named num [1:2000] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "names")= chr [1:2000] "Hsa.3004" "Hsa.13491" "Hsa.13491.1" "Hsa.37254" ...
#>  $ iter          : int 2
#>  $ resid         : num [1:62] 0.987 0.617 0.905 0.202 0.793 ...
#>  $ lambda        : num 0.05
#>  $ penalty       : chr "lasso"
#>  $ gamma         : num 3
#>  $ alpha         : num 1
#>  $ loss          : num 29.5
#>  $ penalty.factor: num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ n             : int 62

nz1 <- which(fit1$beta != 0)
fit1$beta[nz1]
#>      Hsa.8147     Hsa.36689      Hsa.3152     Hsa.36665     Hsa.42949 
#> -0.0322178250 -0.1131494353  0.0218359134  0.0224074245 -0.0043473699 
#>     Hsa.692.2      Hsa.1272       Hsa.166     Hsa.31801      Hsa.3648 
#> -0.1202222748 -0.0109727114  0.0107157905  0.0414907591  0.0007487836 
#>     Hsa.13628      Hsa.3016      Hsa.5392   Hsa.35496.1      Hsa.1832 
#>  0.0022106709  0.0157166112  0.0360233297  0.0107837452 -0.0180740994 
#>     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159     Hsa.33268 
#> -0.0082794639 -0.0176171036  0.0036267952  0.0105440705 -0.0482435225 
#>      Hsa.6814      Hsa.1660      Hsa.1491   Hsa.41098.1 
#>  0.0302496905  0.0469252260  0.0107054818 -0.0234533426

nz2 <- which(fit2$beta != 0)
fit2$beta[nz2]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07
# note: there are fewer nonzero betas here compared to fit1

fit1$beta[intersect(nz1, nz2)]
#>     Hsa.3152    Hsa.36665    Hsa.42949    Hsa.692.2    Hsa.13628     Hsa.3016 
#>  0.021835913  0.022407425 -0.004347370 -0.120222275  0.002210671  0.015716611 
#>     Hsa.1832    Hsa.12241    Hsa.44244     Hsa.2928    Hsa.41159    Hsa.33268 
#> -0.018074099 -0.008279464 -0.017617104  0.003626795  0.010544071 -0.048243523 
#>     Hsa.6814     Hsa.1660 
#>  0.030249691  0.046925226
fit2$beta[intersect(nz1, nz2)]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07
# note: these estimates are much smaller than fit1! 


# take 2 -----------------------------
fit1_take2 <- biglasso_fit(X.bm, y, lambda = 0.05, xtx = xtx, r = resid,
                           penalty = "lasso", max.iter = 10000)

# compare with `ncvreg::ncvfit()`
fit2_take2 <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
                     penalty = "lasso", max.iter = 10000)

# test coefficients 
tinytest::expect_equal(fit1_take2$beta, fit2_take2$beta, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1_take2$beta, fit2_take2$beta, tolerance = 0.01)

# test residuals
tinytest::expect_equal(fit1_take2$resid, fit2_take2$resid, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1_take2$resid, fit2_take2$resid, tolerance = 0.01)


nz1_take2 <- which(fit1_take2$beta != 0)
fit1_take2$beta[nz1_take2]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07

nz2_take2 <- which(fit2_take2$beta != 0)
fit1_take2$beta[nz2_take2]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.722814e-06 -9.890970e-08 -1.908319e-07  9.282081e-07  5.391427e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07  1.868370e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.609362e-07  2.094525e-07  2.229240e-07

fit1_take2$beta[intersect(nz1_take2, nz2_take2)]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.722814e-06 -9.890970e-08 -1.908319e-07  9.282081e-07  5.391427e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07  1.868370e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.609362e-07  2.094525e-07  2.229240e-07
fit2_take2$beta[intersect(nz1_take2, nz2_take2)]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.638265e-06 -7.992673e-08 -1.577083e-07  9.061937e-07  4.678115e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.529658e-07 -5.153089e-07 -8.802163e-07  4.484431e-07  1.429543e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.589283e-07  1.462757e-07  1.735474e-07
# now all the estimated coefficients are just close to 0


# Questions: 
#   (1) why is it that in take 1, the test of residuals passes but the test of beta doesn't?
#   (2) why is it that in take 2, the beta coefficients are different? 
#       The data passed to the functions is the same. Nothing is stochastic here....
#   (3) why does running the same code produce different answers the second time I run it? (Related to issue 2)

^{Created on 2024-04-12 with reprex v2.1.0}

Warnings/errors on run

I have run biglasso on a similar dataset as reported previously (mostly sparse, other whole numbers cast as numeric). I made sure nothing was NA, infinite, or NaN and type was numeric (I remember the integer bug from before).

glm.trained = cv.biglasso(as.big.matrix(X),factor(y),family="binomial"ncores=16)
Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA
> plot(glm.trained)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
> glm.trained$cve
numeric(0)
> glm.trained$cvse
numeric(0)
> glm.trained$lambda
numeric(0)

Is this expected? I have never been able to get bigLasso to work on real problems for me like glmnet -- maybe my use cases are too unconventional?

> sum(is.na(X))
[1] 0
> sum(is.infinite(X))
[1] 0
> sum(is.nan(X))
[1] 0
str(X)
 num [1:72, 1:1000926] 0 35 0 0 59 0 456 0 0 0 ...
sum(rowMeans(X)==0)
[1] 0

Glmnet runs fine. Other fields within the output, glm.trained, are populated, such as $y (with 0's and 1's that line up reasonably well with the actual classes), center and scale, and the coefficient matrix looks like it has numbers in it (quite sparse, as expected).

Thanks for any insight. I can try to finagle a minimal reproducible example if the problem isn't obvious with my command and the error messages!

Cross validation tests failing

When running the test_biglasso_linear tests I'm getting errors in the linear cross validation tests.

Platform: Ubuntu 18.04
R version: 4.1.2
big lasso version: devtools::install_github("YaohuiZeng/biglasso") on 2/3/2022

$ R -f test_biglasso_linear.R

── Failure (???): Test cross validation:  ──────────────────────────────────────
as.numeric(cvfit.ncv$cvse) not equal to as.numeric(cvfit.hybrid$cvse).
100/100 mismatches (average diff: 0.484)
[1] 4.14 - 3.69 == 0.449
[2] 4.12 - 3.69 == 0.429
[3] 4.10 - 3.72 == 0.379
[4] 4.08 - 3.74 == 0.340
[5] 4.05 - 3.73 == 0.321
[6] 4.02 - 3.71 == 0.306
[7] 4.00 - 3.69 == 0.304
[8] 3.99 - 3.68 == 0.306
[9] 3.98 - 3.67 == 0.310
...

── Failure (???): Test cross validation:  ──────────────────────────────────────
as.numeric(cvfit.ncv$cvse) not equal to as.numeric(cvfit.adaptive$cvse).
100/100 mismatches (average diff: 0.484)
[1] 4.14 - 3.69 == 0.449
[2] 4.12 - 3.69 == 0.429
[3] 4.10 - 3.72 == 0.379
[4] 4.08 - 3.74 == 0.340
[5] 4.05 - 3.73 == 0.321
[6] 4.02 - 3.71 == 0.306
[7] 4.00 - 3.69 == 0.304
[8] 3.99 - 3.68 == 0.306
[9] 3.98 - 3.67 == 0.310
...

Modifying the tests to include 1,000,000 rows and 1000 columns still results in errors:

── Failure (???): Test against ncvreg for entire path: ─────────────────────────
as.numeric(fit.ncv$beta) not equal to as.numeric(fit.hybrid$beta).
4090/100100 mismatches (average diff: 0.234)
[1004] 0.0390 -  0.1028 == -0.0638
[1019] 0.0507 -  0.1145 == -0.0639
[2005] 0.0881 -  0.2096 == -0.1215
[2020] 0.0998 -  0.2214 == -0.1216
[2039] 0.0000 - -0.0453 ==  0.0453
[3006] 0.1357 -  0.3092 == -0.1735
[3021] 0.1475 -  0.3210 == -0.1735
[3028] 0.0000 - -0.0945 ==  0.0945
[3032] 0.0000 - -0.0108 ==  0.0108
...

── Failure (???): Test against ncvreg for entire path: ─────────────────────────
as.numeric(fit.ncv$beta) not equal to as.numeric(fit.adaptive$beta).
4090/100100 mismatches (average diff: 0.234)
[1004] 0.0390 -  0.1028 == -0.0638
[1019] 0.0507 -  0.1145 == -0.0639
[2005] 0.0881 -  0.2096 == -0.1215
[2020] 0.0998 -  0.2214 == -0.1216
[2039] 0.0000 - -0.0453 ==  0.0453
[3006] 0.1357 -  0.3092 == -0.1735
[3021] 0.1475 -  0.3210 == -0.1735
[3028] 0.0000 - -0.0945 ==  0.0945
[3032] 0.0000 - -0.0108 ==  0.0108
...

Option for sample weights?

I was wondering if there was a way to put in sample weights for biglasso. In cv.glmnet there is a "weights" option and it would be very useful for econometrics work.

predict.cv.biglasso - address 0x2b2da5f69ae0, cause 'memory not mapped'

I was trying to predict.cv.biglasso and got the following error :

I havent been able to figure why this error occurs. Any ideas?

*** caught segfault ***
address 0x2b2da5f69ae0, cause 'memory not mapped'

Traceback:
1: get_eta(X@address, as.integer(row.idx - 1), beta, beta.T@i, beta.T@j)
2: predict.biglasso(object$fit, X = X, row.idx = row.idx, type = type, lambda = lambda, which = which, ...)
3: predict.cv.biglasso(fit1, X, type = "response", lambda = fit1$lambda[fit1$min])
4: predict(fit1, X, type = "response", lambda = fit1$lambda[fit1$min])
5: as.vector(predict(fit1, X, type = "response", lambda = fit1$lambda[fit1$min]))
6: lasso_reg(test_sub, "101_anat_fmri.nii.gz", "afni_biglasso_selection_cv_fs+orig.BRIK.gz", "penalty_bglasso_obj_fs.RData", "fs")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/torque/mom_priv/jobs/11529146.owens-batch.ten.osc.edu.SC: line 43: 7338 Segmentation fault Rscript /users/PAS1316/prateeksasan/hcp/wexner/controls_lasso_reg_oos.R

Benchmarks

It would be very nice if there were a set of benchmarks comparing this implementation to other implementations in R, such as:

As far as I can tell, right now there's only the author's word for the efficiency.

Error R_Reprotect: only x protected items, can't reprotect index y

This might be related to #41 but as the error is different, I'm reporting this as a separate issue. In fact, this was the problem I came across first but when trying to boil everything down to a reproducible example, I ran into what is reported in #41. What seems to trigger this more severe issue is simply a larger problem size when compared to what is reported in #41.

set.seed(11)

nrow <- 3000000
ncol <- 1000

mat <- bigmemory::big.matrix(nrow, ncol, type = "double", shared = TRUE)

for (i in seq_len(ncol)) {
 mat[, i] <- sample(c(0, 1), nrow, replace = TRUE)
}

res <- biglasso::cv.biglasso(
 mat, sample(c(TRUE, FALSE), nrow, replace = TRUE), family = "binomial",
 ncores = 8, nlambda = 50, verbose = TRUE
)

yielding

Preprocessing start: 2021-09-21 13:23:31.000
Preprocessing end: 2021-09-21 13:23:41.000

-----------------------------------------------
Lambda 1. Now time: 2021-09-21 13:23:42.000
...
Lambda 49. Now time: 2021-09-21 13:51:23.000
Error in structure(new.time - time, class = "proc_time") : 
 R_Reprotect: only -69 protected items, can't reprotect index -71
Calls: <Anonymous> -> biglasso -> system.time -> structure
Error during wrapup: R_Reprotect: only -53 protected items, can't reprotect index -54

and my sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cluster/spack/apps/linux-centos7-x86_64/gcc-8.2.0/openblas-0.3.7-aapr6wps44sbmhbedmmigxjk5bhfy5p3/lib/libopenblas_haswell-r0.3.7.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 bigmemory_4.5.36    compiler_4.0.2     
[4] Matrix_1.2-18       biglasso_1.4.1      parallel_4.0.2     
[7] Rcpp_1.0.7          grid_4.0.2          ncvreg_3.13.0      
[10] lattice_0.20-41

Happy to share additional info and also to do some further probing of the issue if someone has any pointers.

Installation error on CRAN

It seems {biglasso} is getting lots of "Installation failed" on CRAN (cf. https://cran.r-project.org/web/checks/check_results_biglasso.html).

And this starts to propagate to my package as well.
It would be great if you could look into it.

Update benchmarks

The last two releases of glmnet have yielded significant speed increases, as legacy Fortran code has been replaced with modern C++ (see their NEWS file). In the interest of objectivity, I think it would be a good idea to update the benchmarks that are highlighted on the README here. AFAICT these benchmarks were last run in 2020.

Regardless, thanks for maintaining this excellent package. I'm a big fan.

Running biglasso is much slower than before

Before, with my CentOS 7.2, it was very fast.
Now, I've upgraded to 7.4. I also tried with Linux Mint and with another computer.

Every run of biglasso (or my function that has a very similar code) is 1.5-10 times slower (sometimes more).

Have you any idea what could cause this huge drop in performance?
A missing library? Some difference in compiler optimizations?

This prevents me from finishing my simulations for a paper, any insight would be MUCH appreciated.

Subsetting a

I think you should always subset a at https://github.com/YaohuiZeng/biglasso/blob/master/R/biglasso.R#L263.

An early stop can be the consequence of a saturated model (only in the logistic model) but also the consequence of the use of dfmax, which can appear in gaussian models too.
It then causes an error at https://github.com/YaohuiZeng/biglasso/blob/master/R/biglasso.R#L275 due to the length incompatibility between the two subtracted vectors.

Error in calling table(y) on numeric variable

I hit an error ("unique() applies only to vectors") when trying to run cv.biglasso on a numeric response.

if (fit$family == "binomial" & (min(table(y)) > nfolds)) is not short circuiting when fit$family is gaussian and thus table(y) is being called regardless of family.

Changing & to && should take care of the error.

Positivity or box constraints on fitted coefficients?

In glmnet it is possible to impose positivity constraints on the fitted coefficients using the lower.limits argument and box constraints using the upper.limits argument. Would it be possible to incorporate this feature too in biglasso? Maybe just by clipping the coefficients to the specified bounds after each iteration or something?

row.idx doesn't work ?

I try to split X,y by row.idx parameter in biglasso and cv.biglasso
however when row.idx is been used, cv.biglasso will return error massage
"錯誤發生在 E[cv.ind == i, 1:res$nl] <- res$loss：
被替換的項目不是替換值長度的倍數"
sorry for chinese version massage,
this massage more likely indicate that two lengths "E[cv.ind == i, 1:res$nl]" and "res$loss" are different

code

library(biglasso)

data("colon")

X <- as.big.matrix(colon$X)
y <- rnorm(62, mean=50)

tr.ind <- sample(length(y), length(y)*0.85, replace = FALSE)
tt.ind <- seq_len(length(y))[-tr.ind]

## With Error massage after cv.biglasso
mdl <- biglasso(X,y,row.idx = tr.ind, penalty = "lasso", family = "gaussian")
cv.mdl <- cv.biglasso(X,y,row.idx = tr.ind, nfolds = 10, penalty = "lasso", family = "gaussian")

## Without Error massage after cv.biglasso
tr.X <- as.big.matrix(colon$X[tr.ind,])
tt.X <- as.big.matrix(colon$X[-tr.ind,])

tr.mdl <- biglasso(tr.X, y[tr.ind], penalty = "lasso", family = "gaussian")
cv.tr.mdl <- cv.biglasso(tr.X,y[tr.ind], nfolds = 10, penalty = "lasso", family = "gaussian")

session info

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=zh_TW.UTF-8       LC_NUMERIC=C               LC_TIME=zh_TW.UTF-8        LC_COLLATE=zh_TW.UTF-8     LC_MONETARY=zh_TW.UTF-8   
 [6] LC_MESSAGES=zh_TW.UTF-8    LC_PAPER=zh_TW.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_TW.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biglasso_1.4.1   ncvreg_3.13.0    Matrix_1.3-4     bigmemory_4.5.36

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 compiler_4.1.1      parallel_4.1.1      tools_4.1.1         Rcpp_1.0.7          grid_4.1.1          lattice_0.20-45

Split available cores in cv.biglasso()

I have a quick question: say you have 40 cores available for a 5-fold cv.biglasso() run. From having a quick look at your code, it seems to me that this would have 35 cores doing nothing. From a distance it seems straightforward, adding a couple of minor changes, to have each fold work with 8 cores (and thereby making such a scenario a bit more efficient). Or am I mistaken? I'm happy to prepare a PR adding such functionality.

Memory not mapped

This package has been quite useful for my research and thank the author for writing it. However, sometimes I experienced the following problem:

In short words, I have a function which calls biglasso, and I was trying to use mclapply function in the package parallel to have multiple cores to run lasso with same X (created by setupX) and different y.

Here is my function, I think procedures other than biglasso are unimportant:

 if (k > 1) {
   y <-sample(realy)
 } else {
   y <-realy
 }
 tryCatch( {
   time <- system.time(  
     fit <- biglasso(X, y, family='gaussian', penalty.factor = penalty, 
                   lambda.min = 0.0001, nlambda = 250)
   )
   print(time)
   # do the "maxfrac" operation
   coef = as.matrix(fit$beta)
   coef[coef < 0] <- 0.0
   coef = round(coef, digits = 6)
   colnames(coef) = fit$lambda
   if (k == 1) {
     write.table(coef, file=paste(outfname, ".coef", sep=""), sep="\t")
   }
   norm_beta = scale(coef, center=FALSE, scale=colSums(coef))
   beta_max = apply(norm_beta, 1, max)
 }, error = function(e) {
   return(rep(-1, dim(X)[2]+1)) } 
 )
 return(beta_max)
}

Then I use beta_max_all <- mclapply(1:(batch_size+1), bottleneck, mc.cores=n_cores, mc.cleanup=TRUE) for multi-core jobs.

I'd be appreciate if the author can help me figure out what causes the problem. I cannot see any rules about when such issue happens; it seem to occur randomly and sometimes the script just runs smoothly. I run my computing on an Sun Grid Engine Cluster.

Here is all the error message.

*** caught segfault ***
address 0x30, cause 'memory not mapped'

Traceback:
 1: biglasso(X, y, family = "gaussian", penalty.factor = penalty,     lambda.min = 1e-04, nlambda = 250)
 2: system.time(fit <- biglasso(X, y, family = "gaussian", penalty.factor = penalty,     lambda.min = 1e-04, nlambda = 250))
 3: doTryCatch(return(expr), name, parentenv, handler)
 4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 5: tryCatchList(expr, classes, parentenv, handlers)
 6: tryCatch({    time <- system.time(fit <- biglasso(X, y, family = "gaussian",         penalty.factor = penalty, lambda.min = 1e-04, nlambda = 250))    print(time)    coef = as.matrix(fit$beta)    coef[coef < 0] <- 0    coef = round(coef, digits = 6)    colnames(coef) = fit$lambda    if (k == 1) {        write.table(coef, file = paste(outfname, ".coef", sep = ""),             sep = "\t")    }    norm_beta = scale(coef, center = FALSE, scale = colSums(coef))    beta_max = apply(norm_beta, 1, max)}, error = function(e) {    return(rep(-1, dim(X)[2] + 1))})
 7: FUN(X[[i]], ...)
 8: lapply(X = S, FUN = FUN, ...)
 9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
13: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
14: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
15: FUN(X[[i]], ...)
16: lapply(seq_len(cores), inner.do)
17: mclapply(1:(batch_size + 1), bottleneck, mc.cores = n_cores,     mc.cleanup = TRUE)
18: system.time(beta_max_all <- mclapply(1:(batch_size + 1), bottleneck,     mc.cores = n_cores, mc.cleanup = TRUE))
An irrecoverable exception occurred. R is aborting now ...