Giter Site home page Giter Site logo

distrom's People

Contributors

mataddy avatar nelson-n avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distrom's Issues

Outdated and slightly unclear documentation

Hi Matt. Thanks for the great packages and papers.

A couple of small issues with the docs of dmr function.

  • "Then each individual is outsourced to Poisson regression ..." could be interpreted as if you are splitting the mattrix by rows (individuals) not by columns. I was quite confused till I digged into the code proper. It might be cleaner to say something like "Then categories are outsourced in uniform chunks to Poisson regression ...".

  • p in "and p is the dimension of x_i" is not defined. The sum is ranging over i but without the upper limit.

  • the meaning of bins arguments is unclear. I digged all the way to collapse function and even then could not figure out what it does till I run it on some examples while stepping through dmr function.

    I think it's a very useful argument and the doc could be better by clearly stating what is happening at the lower level and what's the intention of it. Maybe something along "bins is the number of quantile cuts for each covariate and is intended for data discretization. The discretized covariates are then interacted to form groups within which the values of original covariates are averaged and counts are summed over.". I think no matter how good the documentation of bins is it won't be clear unless a comprehensive example were provided. Note that collapse function doesn't contain any examples.

  • References to your papers are outdated. For instance "Taddy (2013) The Gamma Lasso" should probably be Taddy, Matt. 2013. ‘One-Step Estimator Paths for Concave Regularization’. arXiv:1308.5623 [stat], August. http://arxiv.org/abs/1308.5623.. And the distributed paper: Taddy, Matt. 2015. ‘Distributed Multinomial Regression’. The Annals of Applied Statistics 9 (3): 1394–1414..

Changes to as(object, Class) in Matrix 1.4-2

Documenting this potential issue for future reference.

The next version of the Matrix package (1.4-2) will formally deprecate 187 coercion methods. More
precisely, coercions of the form: as(object, Class).

Given that distrom coerces matrices to dgC format (see line 90 of dmr.R: as(Bs,"dgCMatrix")), this would normally be an issue that requires a change to the code. However, Matrix maintainer Martin Maechler has stated:

"In the mean time we have decided to keep as(<traditional_R_matrix>, "dgCMatrix") non-deprecated as it is used in so many places."

Because distrom only coerces to dgCMatrix and not other types, I believe that the changes in Matrix 1.4-2 will not affect the distrom package. If it turns out to be a problem, I will revisit this issue.

fix nobs after collapse

right now it is sum(nbin), where nbin is the tabulated counts per bin. investigate whether it should just be the simple length(nbin) for BIC calculations

Issue applying Multinomial Logistic Regression on Congress109 dataset

Hi TaddyLab,
I'm completely new to GitHub, so I'm not sure whether this is the right place to publish my issue. However, after intensive research I decided to go this way: I am currently trying to apply mnlm to the Congress109 dataset and get the following error for an if-loop in the function.
"Error in if (C < p/4) { : argument is of length zero"
Do you have any suggestion how to resolve this problem

Kind regards,
Max

Memory error on collapse step

I'm running DMR on a sparse "v" matrix (1M x 13k columns). I get an out of memory error during the "collapse" step ( I am specifying bins=2 because "v" is a matrix of 0/1 values). It looks like the error occurs in this step:

B <- apply(v,2,cutit)

I'm not 100% sure, but it could be this statement in the "apply" function is casting the sparse matrix to a dense matrix? Here is the top bit of "apply":

apply
function (X, MARGIN, FUN, ...)
{
FUN <- match.fun(FUN)
dl <- length(dim(X))
if (!dl)
stop("dim(X) must have a positive length")
if (is.object(X))
X <- if (dl == 2L)
as.matrix(X)
else as.array(X)

Indexing error '[[' in coef.dmr

Thank you for the awesome package.

In the coef method of the dmr class, counts (columns) that have no observations (all zeroes) and thus Null poisson regression coefficients are assigned the coefficient 0. This is done in the following line:

Line 81 of dmr.R:
failures <- sapply(B,is.null)
if(any(failures)) B[[which(failures)]] <- Matrix(0)

Because [[ can only index a single element and not a vector of elements, this line results in an error if the counts matrix has multiple columns with no observations and thus multiple features that are Null. The temporary solution that I have been using is to replace [[ with [ in line 82 which allows for list indexing with a vector of elements. I do not love this solution because it throws the warning message: implicit list embedding of S4 objects is deprecated. I am sure that there is an elegant solution, but as it stands distrom:::coef.dmr and by extension textir::srproj do not work if multiple columns in the matrix have no observation (all zeroes).

new error in dmr cv option

DMR works fine if I don't use the CV option, but if I turn it on, I get an error every time. I can't imagine what the issue is - the exact same code (with cv=TRUE) was working fine a week ago, but I updated the package since then? anyways... here's the console output that shows the error.

dim(X)
[1] 1639 547
dim(Y)
[1] 1639 3
cl <- makeCluster(detectCores(), type="FORK")
DM<-dmr(cl=cl, covars=X, counts=Y, cv=F)#, nfold=10)#, select="min")
summary(DM)
Length Class Mode
SUPG5 11 gamlr list
SUPG6 11 gamlr list
SUPG7 11 gamlr list
DM<-dmr(cl=cl, covars=X, counts=Y, cv=T)#, nfold=10)#, select="min")
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: error in evaluating the argument 'x' in selecting a method for function 't': Error: length(eta) == n is not TRUE

UPDATE: I went back through the github history and the March 16th version works just fine... I'll keep using that until I hear otherwise. Here's the console output:

source("dmrOLD.R")
X<-sets[["ALL"]][REAL,]
Y<-MM[REAL,(colMeans(MM[REAL,])>0)]
cl <- makeCluster(detectCores(), type="FORK")
DM<-dmr(cl=cl, covars=X, counts=Y, cv=T, nfold=10, select="min")
stopCluster(cl)
summary(DM)
Length Class Mode
SUPG5 10 cv.gamlr list
SUPG6 10 cv.gamlr list
SUPG7 10 cv.gamlr list

Warning: hit max CD iterations

I'm not sure if this is an actual issue or just something that should be expected with the gamma lasso regression, but I keep running into the "Warning: hit max CD iterations" on random words. I've got a sample of 5,000 text documents that I have been using to try and track down the problem. The vocabulary is 16,350 words and there are 100 coefficients (mostly dummy variables) for the "covars" argument. I only encounter the issue with 29 of the 16,350 words and the only pattern I can see is that the error always pops up on the second segment of the lambda path (I'm using nlambda = 100). I'm happy to share the sample data if this seems like an error and replicating it would be helpful.

Thanks for a great package! Also, I recently picked up a copy of "Business Data Science" and it is really fantastic! I wish I had read it as a PhD student.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.