zachmayer / caretensemble Goto Github PK

View Code? Open in Web Editor NEW

226.0 26.0 75.0 21.44 MB

caret models all the way down :turtle:

Home Page: http://zachmayer.github.io/caretEnsemble/

License: Other

R 96.52% Makefile 3.48%

caretensemble's Introduction

caretEnsemble

caretEnsemble is a framework for stacking models fit with the caret package.

Use caretList to fit multiple models, and then use caretStack to stack them with another caret model.

First, use caretList to fit many models to the same data:

set.seed(42L)
data(diamonds, package = "ggplot2")
dat <- data.table::data.table(diamonds)
dat <- dat[sample.int(nrow(diamonds), 500L), ]
models <- caretEnsemble::caretList(
  price ~ .,
  data = dat,
  methodList = c("rf", "glmnet")
)
print(summary(models))
#> The following models were ensembled: rf, glmnet  
#> 
#> Model accuracy:
#>    model_name metric    value       sd
#>        <char> <char>    <num>    <num>
#> 1:         rf   RMSE 1076.492 215.4737
#> 2:     glmnet   RMSE 1142.082 105.6022

Then, use caretEnsemble to make a greedy ensemble of these models

greedy_stack <- caretEnsemble::caretEnsemble(models)
print(greedy_stack)
#> The following models were ensembled: rf, glmnet  
#> 
#> caret::train model:
#> Greedy Mean Squared Error Optimizer 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 400, 400, 400, 400, 400 
#> Resampling results:
#> 
#>   RMSE      Rsquared   MAE     
#>   969.2517  0.9406218  557.1987
#> 
#> Tuning parameter 'max_iter' was held constant at a value of 100
#> 
#> Final model:
#> Greedy MSE
#> RMSE:  989.2085 
#> Weights:
#>        [,1]
#> rf     0.55
#> glmnet 0.45

You can also use caretStack to make a non-linear ensemble

rf_stack <- caretEnsemble::caretStack(models, method = "rf")
#> note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
print(rf_stack)
#> The following models were ensembled: rf, glmnet  
#> 
#> caret::train model:
#> Random Forest 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 400, 400, 400, 400, 400 
#> Resampling results:
#> 
#>   RMSE      Rsquared  MAE     
#>   1081.425  0.930012  540.3294
#> 
#> Tuning parameter 'mtry' was held constant at a value of 2
#> 
#> Final model:
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: regression
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>           Mean of squared residuals: 925377
#>                     % Var explained: 93.95

Use autoplot from ggplot2 to plot ensemble diagnostics:

ggplot2::autoplot(greedy_stack, training_data = dat, xvars = c("carat", "table"))

ggplot2::autoplot(rf_stack, training_data = dat, xvars = c("carat", "table"))

Installation

Install the stable version from CRAN:

install.packages("caretEnsemble")

Install the dev version from github:

devtools::install_github("zachmayer/caretEnsemble")

There are also tagged versions of caretEnsemble on github you can install via devtools. For example, to install the previous release of caretEnsemble use:

devtools::install_github("zachmayer/[email protected]")

This is useful if the latest release breaks some aspect of your workflow. caretEnsemble is pure R with no compilation, so this command will work in a variety of environments.

Package development

This package uses a Makefile. Use make help to see the supported options.

Use make fix-style to fix simple linting errors.

For iterating while writing code, run make dev. This runs just make clean fix-style document lint spell test, for a quicker local dev loop. Please still run make all before making a PR.

Use make all before making a pull request, which will also run R CMD CHECK and a code coverage check. This runs make clean fix-style document install build-readme build-vignettes lint spell test check coverage preview-site.

First time dev setup:

run make install from the git repository to install the dev version of caretEnsemble, along with the necessary package dependencies. # Inspiration and similar packages: caretEnsemble was inspired by medley, which in turn was inspired by Caruana et. al.’s (2004) paper Ensemble Selection from Libraries of Models.

If you want to do something similar in python, check out vecstack.

Code of Conduct:

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

caretensemble's People

Contributors

Stargazers

Watchers

Forkers

opyate jknowles rseiter razielmelchor nianxue jasonjensen gbortz27 libardo1 amunategui farbodr rlesca01 nicolegoebel erudie andremikulec kireru dreammaster38 appledata matthewalanham mcolosso damonzon kymr rakato shamrockiris christal251314 piotrszul gitter-badger yourwanghao fdoperezi nithum seli07 jmwdpk jasoncec datamining4science eric-czech paulhendricks tpopenfoose washcycle merico34 nathan36 umeshach mutual-ai hubdr anhnguyendepocen 29antonioac smartgamer akpadhi data-drone germayneng weekend-warrior nanaakwasiabayieboateng the-tourist- radovankavicky gapdata malhadas rucrbser jasonzhao0307 marcinnarlochbd polabs afcarl thiyangt chadschaeffer phy9 helenjiang21 topazand erhard1 jiqibuaixuexi bnbyuii996 antongomez

caretensemble's Issues

Allow trace = FALSE in buildModels

Currently, when training a nnet or a multinom or other forms of neural networks, the R console is filled with information about the individual iterations. Passing trace = FALSE to train when fitting a nnet suppresses this console clutter.

However, currently buildModels will not properly pass trace = FALSE to the train argument. This should be fixed!

Issue with caretEnsemble

I am using the following code but it is resulting into Error: length(unique(indexes)) == 1 is not TRUE. AS per some previous post I have incuded the seeding part but still I am getting the error. Please help. I am using R 3.03.

library(mlbench)
data(Sonar)
library(caret)
set.seed(123)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 3)

For the last model:

seeds[[11]] <- sample.int(1000, 1)
inTraining <- createDataPartition(Sonar$Class, p = 0.75, list = FALSE)
training <- Sonar[inTraining, ]
testing <- Sonar[-inTraining, ]
fitControl <- trainControl(method = "cv", number = 10,repeats = 1,classProbs=TRUE, savePredictions=TRUE, seeds=seeds)

PP <- c('center', 'scale')
set.seed(1)

gbmFit1 <- train(Class ~ ., data = training,method = "svmRadial",trControl = fitControl,verbose = FALSE)

gbmFit2 <- train(Class ~ ., data = training,method = "gbm",trControl = fitControl,verbose = FALSE)
all.models <- list( gbmFit1 , gbmFit2 )

greedy <- caretEnsemble(all.models, iter=1000L)

sort(greedy$weights, decreasing=TRUE)

greedy$error

Subscript Out of Bounds

Thanks for the code. I am really enjoying using it. I am seeing an issue pop-up on only a subset of the data sets that I am training. Any ideas what I might look into to address this?

Error in foo(betas[[length(betas)]], newdata)[, 1] : 
  subscript out of bounds

The traceback looks like this.

11: predictionFunction(method, modelFit, tempX, custom = models[[i]]$control$custom$prediction)
10: extractPrediction(list(object), unkX = newdata, unkOnly = TRUE, 
    ...)
9: predict.train(x, type = "raw", newdata = newdata, ...)
8: predict(x, type = "raw", newdata = newdata, ...) at helper_functions.R#137
7: FUN(X[[i]], ...)
6: pblapply(X, FUN, ...)
5: pbsapply(list_of_models, function(x) {
   if (type == "Classification" & x$control$classProbs) {
       predict(x, type = "prob", newdata = newdata, ...)[, 2]
   }
   else {
       predict(x, type = "raw", newdata = newdata, ...)
   }
   }) at helper_functions.R#133
4: multiPredict(ensemble$models, type, ...) at caretEnsemble.R#75
3: predict.caretEnsemble(fit, newdata = data.x)
2: predict(fit, newdata = data.x) at deposits-models.R#130
1: trainThenPredict(by, data, data.id)

installing caretEnsemble without net connection

Hi Zach,
caretEnsemble is proving to be very useful for my work and it is working fine. However, I would like caretEnsemble to be installed on a high end machine but it is not connected to internet. So my query is:

If I do not have internet connection on my machine (but possible on a different machine) is there any standalone version (tar ball) of caretEnsemble which can be installed without accessing internet connection? And how to install it (like for R package, is it R CMD INSTALL package.tar.gz)
Thanks

safeOptRMSE

Similar to safeOptAuc, stops ensembling if accuracy decreases. Consider making the safe functions the defaults, and renaming the old functions fastOpt*.

Then we could have a flag to caretEnsemble for safe=TRUE. If TRUE, we use the safe functions, if FALSE we use the slightly faster non-safe functions.

If there's not a big difference between the safe vs non-safe speed, we get rid of the non-safe ones.

n.minobsinnode

Inclusion of n.minobsinnode in GBM model gives an error:
model1 <- train(X, Y, method='gbm', trControl=myControl,tuneGrid=expand.grid(.n.trees=100, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode = 10))

My query is, how to pass n.minobsinnode in model1.

Thanks

Make a clean.caretList function

Goes through the models are removes data that isn't needed, e.g. resamples, out-of-sample predictions, the original dataset, and anything else we can think of.

After making a caretEnsemble or caretStack, we'd run clean.caretList to shrink the final models as much as possible.

Installation

Zachary,

I am trying to install caretEnsemble for Windows 7.0. Please tell me the exact procedure for installing on windows. I have downloaded the caretEnsemble-master.zip but after installing it from the zip file it is not seen in the library list. So I can't load the library. Similarly with the caretEnsemble-Dev version.I will appreciate if you can give me a qucik reply on this.

rEgards
amit

AUC by caretensemble

While using caret ensemble we caome acorss 2 AUCs. 1 is while getting the error which is obtained by using
error <- colAUC(as.matrix(temp) %% weights, temp3)(here as.matrix(temp) %% weights give us the combined predictions of all the models)

and the other is given by the predicted probabilites obtained by using the formula
return(list(preds = data.frame(pred = est, se = se.tmp), weight = conf))
and which is then used to obtain the ROC, AUC and accuracy .
The 2 AUCs differ. So which is the correct one .Which predictions should be used to obtain the accuracy.

REgards
Amit

Add parallel option to predict.caretStack using foreach

Ensembling decreases accuracy

I am applying caretEnsemble on a large dataset and have noticed the curious case where ensembling sometimes produces an ensembled prediction with an AUC that is lower than the AUC of any of the ensembled models.

I need to check that a) this is not a bug in the print or summary methods to caretEnsemble and b) that we issue a warning when this is the case so that the user is notified.

Installation failed due to error in Rd file

On Mac OS, doing a fresh install:

installing source package 'caretEnsemble' ...
** R
** data
** inst
** tests
** preparing package for lazy loading
** help
Error : /private/var/folders/jk/21t5rvgd3hgbgvmxlxx26q8w0000gn/T/Rtmpatuz5C/devtools97a269953002/caretEnsemble-master/man/makePredObsMatrix.Rd: Sections \title, and \name must exist and be unique in Rd files
ERROR: installing Rd objects failed for package 'caretEnsemble'
removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/caretEnsemble'
Error: Command failed (1)

Add parallel option to predict.caretEnsemble using foreach

trim.caretList, trim.caretEnsemble, and trim.caretStack methods

I'd like to see us add a function that prunes the train models stored in the caretEnsemble object. There is a lot of stuff stored in these individual train objects that takes up a lot of space and is not necessary for doing predictions or diagnostics. This is especially a problem when training models with a large number of observations (50k+).

This is an example of how much can be saved from a glm object and an approach we might want to take here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r

examples for major functions

I think CRAN likes having examples

caretEnsemble returns NaN for metric

When I run the following example:

library(caretEnsemble)
devtools::install_github("jknowles/EWStools") # to get the data
library(EWStools)
data(EWStestData)


ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 3, classProbs = TRUE, savePredictions = TRUE,
                     summaryFunction = twoClassSummary)


out <- buildModels(methodList = c("knn", "glm", "nb", "lda", "ctree"), 
                   control = ctrl, x = modeldat$traindata$preds, 
                   y = modeldat$traindata$class, tuneLength = 5)

# ensemble we will

out.ens <- caretEnsemble(out)

The model ensembles correctly and the object can produce predictions, but it is curious what summary reports:

> summary(out.ens)
The following models were ensembled: knn, glm, nb, ctree 
They were weighted: 
0.03 0.51 0.45 0.01
The resulting AUC is: 0.9857
The fit for each individual model on the AUC is: 
  method    metric    metricSD
1    knn 0.9402030 0.023904627
2    glm 0.9863390 0.008624078
3     nb       NaN          NA
4  ctree 0.8962804 0.036620901

It looks like the extractModRes function has an issue in how it selects the error parameters for the individual models.

Vignette

We should make a short package vignette showing the difference between non-ensembled train objects and caretEnsemble objects and briefly discussing the advantages of ensembles.

Maybe not for version 1.0, but 1.1 or so.

investigate how caret sets seeds for parallel models

It's possible we don't need the setSeeds. I think #50 will fix the incorrect resampling indexes error when we go to ensemble the models.

I think seeds only need to be set when models are fit in parallel, because if you start 10 R processes (one for each of 10 CV folds) they will tend to have the same random seed, and therefore all fit the same model with no randomness.

I think pre-setting the re-sampling indexes will make the samples used for each model truly random, and I think the default caret logic will make sure different seeds are used for different models, fit in parallel.

Cleanup API and remove unnecessary @export tags

I'd like to hide as much code as possible (other than caretList, caretEnsemble, and caretStack of course!) from the 1.0 release. I don't wan't people to start using internal functions and then get mad when we re-factor them or replace them.

Add parallel option to caretList and predict.caretList using foreach

In some cases this might be more desirable than just letting caret do the parallelization— e.g. we're fitting a bunch of different models, each of which has 1 set of tuning parameters.

Plots vignette?

The tests make some cools plots... I'd love to see a vignette that shows off some of the plot.caretEnsemble and plot.caretList plots.

Substituting index=createMultiFolds by index= createTimeSlices produces an error when running the models

Thanks Zach for the latest updates

I would like to substitute in my control
index=createMultiFolds(Y[train], k=folds, times=repeats)

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index=createMultiFolds(Y[train], k=folds, times=repeats) )

index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

This works fine, however when attempting to run my control on a model I get the following error
model2 <- train(X[train,], Y[train], method='bagEarth',trControl=myControl,preProcess=PP)
Error en -unique(training) :
argumento no válido para un operador unitario

Is there a way to apply caret createTimeSlices to the ensemble maybe I am missing something
Any help welcomed
Thank you
Barnaby

update.caretEnsemble to run more iterations

If the user decides the model needs some more iterations.

Maybe one of the graphs should be a "convergence" graph that shows the error decreasing over time? And if the error looks like it's still going down, the user could call update to run more iterations?

summary/print/plot methods for caretStack

similar to caretEnsemble. Pass plot to plot.train on the ensemble caret model.

Add verbose argument to caretList

As each model is fit, print the name of the model.

Add checks to extractBestPreds

Todo in the code:

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

Probably a matter of writing 3 functions:
checkObserveds
checkRowIndexes
checkResamples

Each one should give an explanatory error message of what's wrong, which model(s) are the culprit, and why we can't make an ensemble in this situation.

This is pulled out of #3, which kind of grew into many separate bug reports.

Document undocumented data sets to pass CMD CHECK

The datasets used in the test-suite are undocumented. To pass the R CMD CHECK they need to either a) be documented, or b) be used only for testing and stored appropriately like they do in the lme4 source (https://github.com/lme4/lme4/tree/master/inst/testdata).

Multi-class classification greedy optimization

i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done.
do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)

Optionally allow buildModels to continue on model failure

If both methodList and tuneList are specified, use the union of both sets

all methodList NOT IN tuneList get added with arguments = NULL
all method IN tuneList get added with arguments = NULL

So if you specify:
methodList='rf'
tuneList=list(rf=list(tuneLength=10), rpart=list(tuneLength=5))

3 models are fit:
rf with default train arguments
rf with tuneLength 10
rpart with tuneLength 10

This is post 1.0

no github under R3.1.1

while trying to install CaretEnsemble, I get the message

install.packages('github')
Warning message:
package ‘github’ is not available (for R version 3.1.1)

What is the way out to install CaretEnsemble on linux ( fedora 17).
Here is the sessionInfo()

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] caret_6.0-35 ggplot2_1.0.0 lattice_0.20-29 devtools_1.5

loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20
[4] codetools_0.2-8 colorspace_1.2-4 digest_0.6.4
[7] evaluate_0.5.5 foreach_1.4.2 grid_3.1.1
[10] gtable_0.1.2 gtools_3.4.1 httr_0.4
[13] iterators_1.0.7 lme4_1.1-7 MASS_7.3-33
[16] Matrix_1.1-4 memoise_0.2.1 minqa_1.2.3
[19] munsell_0.4.2 nlme_3.1-117 nloptr_1.0.4
[22] nnet_7.3-8 parallel_3.1.1 plyr_1.8.1
[25] proto_0.3-10 Rcpp_0.11.2 RCurl_1.95-4.3
[28] reshape2_1.4 scales_0.2.4 splines_3.1.1
[31] stringr_0.6.2 tools_3.1.1 whisker_0.3-2

Thanks

combine predict methods for caretEnsemble and caretStack?

The base prediction is the same: make predictions from each model in the ensemble. The only difference is in one case we weight them and in the other case we run them through another caret model.

Maybe caretList needs it's own class, and there should be a predict.caretList function, which returns a matrix. Both predict.caretEnsemble and predict.caretStack could call predict.caretList.

caretList would obviously return a caretList, and we'd need another helper function to turn arbitrary lists of caretModels into caretList. The models slots of caretEnsemble and caretStack would both be of class caretList.

Consider making caret model list a named object

Building models with identical resampling profiles may be unclear to users. Allowing the user to specify the methods to build train objects from and then the control parameters, and generating the model list as a named class, e.g. a caretList object would make the ensemble approach much more approachable to users. It could look something like:

methodList <- c("rf", "nnet", "knn", "lm")
control <- trainControl(...)

buildModels <- function(methodList, control, ...) {
myseeds <- # build seeds using control parameters
modelList <- list(methodList)
for(i in methodList){
modelList[[i]] <- fit(x = x, y=y, method = i, control = control)
}
return(modelList)
}

Not sure if this is 'out of scope' for this package, but I think it would make the ensembling process more accessible.

For rMSE, use constrained least square for ensemble

After computing your weights with greedOptRMSE, you are normalizing them to sum to one, so you wind up getting a convex combination. You could optimize the same loss function with constraints directly by formulating the problem as a quadratic program.

This should work, but is untested:

qpOptRMSE <- function(x, y) {
      require(quadprog)
      D <- crossprod(x)
      d <- crossprod(x, y)
      A <- cbind(rep(1, ncol(x)), diag(ncol(x)))
      bvec <- c(1, rep(0, ncol(x)))
      solve.QP(Dmat=D, dvec=d, Amat=A, bvec=bvec, meq=1)$solution
    }

It's based on this (which is GLP-3 licensed).

update.caretStack that calls update.Train on the ensembling model

Lets you change the stacking more parameters by hand, if you so desire.

Choose S3 or S4 methods for caretEnsemble

Just a note to choose between PR #42 for S3 methods and PR #41 for S4 methods for the major caretStack and caretEnsemble methods in the package.

@zachmayer Just pull in 1 of the pull requests and you can reject the other as well as PR #39, which is a subset of the other two.

Allow repeat model methods

Say we want 2 random forests in the ensemble, one with mtry=2 and one with mtry=5. Currently we can't have dup methods.

Roadmap?

@zachmayer this package is working in production for me in a couple of places. I was thinking it would be nice to finish up some of the current features and build unit tests to make sure the core feature -- ensembling of classification and regression caret objects, stays stable and works. I'm a little nervous that some of the additions I have been making over in Dev cloud up the focus of the package more than they help.

With that in mind I was thinking we could use this issue to brainstorm ideas for package extensions and prioritize them a bit based on needs. From my end I would say the top 4 things to improve are:

A unit test suite for all objects and their generics/methods
Unit tests for the optimizers, with some edge cases
Make sure the predict method works in edge cases and arrive at some way to calculate prediction errors
Convenient print, plot, and summary methods for caretEnsemble and caretList objects

Then some "nice to have" features would be:

Better variable importance calculations
Speed up prediction performance (parallelize it somehow)

What do you think?

Classification using caretEnsemble

I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?

checkModels complains about multiclass problems using binary classification models

The code below constructs a number of (binary) classification models. Upon bulding an ensemble an error is thrown about multiclass problems.

myControl=trainControl(
classProbs = TRUE,
summaryFunction=twoClassSummary,
method = "repeatedcv",
repeats = 10,
number=10
)

glmtrain = train(x,y, trControl = myControl, method='glm', metric='ROC')
gbmtrain = train(x,y, trControl = myControl, method='gbm', metric='ROC')
rftrain = train(x,y, trControl = myControl, method='rf', metric='ROC')
nbtrain = train(x,y, trControl = myControl, method='nb', metric='ROC')

with str(y) = Factor w/ 2 levels "YES","NO": 2 2 1 2 2 2 2 2 1 1 ...

nestedList <- list(glmtrain, gbmtrain, rftrain, nbtrain)

trainedEnsemble = caretEnsemble(nestedList, iter=1000)
Error in checkModels_extractTypes(list_of_models) :
Not yet implemented for multiclass problems

buildModels fails because of an error in seed dimensions from setSeeds

For some model methods the approximation used by setSeeds fails to produce a seed object of the correct dimensions. Here is a MWE:

set.seed(442)
library(caret)
library(hda)
library(caretEnsemble)
train <- twoClassSim(n = 1000, intercept = -8, linearVars = 3, 
                     noiseVars = 10, corrVars = 4, corrValue = 0.6)
test <- twoClassSim(n = 1500, intercept = -7, linearVars = 3, 
                    noiseVars = 10, corrVars = 4, corrValue = 0.6)

fitControl <- trainControl(method='cv', number = 5, savePredictions = FALSE, 
                           classProbs=TRUE, summaryFunction = twoClassSummary)

out <- buildModels(methodList = c("hda", "multinom"), control = fitControl, 
                   x = train[, -23], 
                   y = train[ , "Class"], metric = "ROC",
                   tuneLength = 4, baseSeed = 1204)

Which produces this helpful output:

Error in train.default(x = x, y = y, method = i, trControl = ctrl, 
tuneLength = tl,  : 
  Bad seeds: the seed object should be a list of length 6 with 5 integer 
vectors of size 48 and the last list element having a single integer

Which is absolutely correct. hda seems to expect seeds along it's grid to follow the formula tuneLength * 3 * tuneLength while usually the tuneLength ^2 approximation is what works.

I should have this patched up rather quickly for hda, but I'm wondering if there is a more programmatic way to set seeds based on tuneLength using some caret functions?

Suppress unnecessary messages in predict.caretEnsemble

If the new dataset has no incomplete cases, I'd like to suppress the message("Predictions being made only for cases with complete data").

Similarly, if all model have available data, I'd like to suppress: message("Predictions being made only from models with available data")

I think it's a little confusing to print those messages in cases when they don't apply. I think we could fix this by adding a newdata=NULL argument and passing it to predict.train.

Auto import for the predict method

If you ensemble a series of models using external packages (e.g. mda or glmnet) and then clear your workspace and R session and load the caretEnsemble object and attempt to use the predict method, it fails.

This is because the predict method for each individual model type is not necessarily in the namespace. The predict.caretEnsemble should be able to identify and import the predict methods necessary to generate predictions for each model type to be ensembled.

Error: length(unique(indexes)) == 1 is not TRUE for caret large ensemble of models

First of all let me thank you for very needed ensemble package which you wrote

I have come out to an issue when attempting to train a large number of models all for regression pre-selecting the models that work I get from 60 to 70 functional models all listed in caret and all working individually. Considering this I attempt to run

greedy <- caretEnsemble(all.models, iter=1000L)

I come out with the following message.

Error: length(unique(indexes)) == 1 is not TRUE

where from your code I go to the origin of the issue which is

makePredObsMatrix(all.models) 
Error: length(unique(indexes)) == 1 is not TRUE
length(unique(indexes) =2

The length in an example for unique(indexes) ranges in [[1]] 212 and 218 in [[2]] while the length for the observations must correspond to any of these ranges as per the description.

Do you know if there is a way to correct this error from within as to include all models. This seems to be a model or group of models specific issue (maybe related to caret 6.0) as far as I can see.

Any help will be welcomed, Thank you

prune.caretEnsemble: remove models with weight 0 from the library

This can also reduce the object size. The user could call this after fitting the ensemble to reduce it's size without changing it's output.

Add verbose option to prediction functions to suppress progress bars

Setup Travis CI

The new version of devtools looks pretty slick. It has functions for setting up testthat, putting data in the correct folders, and even better, using travis CI to automate testthat tests.

I'd really love to setup this project to use travis for automated testing!

http://blog.rstudio.org/2014/10/02/devtools-1-6/
https://travis-ci.org

warning in buildModels if trainControl indexes are not set

If buildModels is not passed a trainControl object with pre-definded indexes, for use with all models, we should raise a warning and try to set the resampling indexes ourselves, based on trainControl$method.

If we can't set the indexes, we should raise an error.

This probably requires a new function setIndexes.

Classification using caretEnsemble

This is the error while using glmnet
"Error in train.default(X[train, ], Y[train], method = "gbm", trControl = myControl, :
final tuning parameters could not be determined"

Automate unit tests

This will help us detect problems as they arise (e.g. next time there's another major update to caret).

zachmayer / caretensemble Goto Github PK

caretensemble's Introduction

caretEnsemble

Installation

Install the stable version from CRAN:

Install the dev version from github:

Package development

First time dev setup:

Code of Conduct:

caretensemble's People

Contributors

Stargazers

Watchers

Forkers

caretensemble's Issues

For the last model:

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

Recommend Projects

Recommend Topics

Recommend Org