mlr-org / mlr Goto Github PK

View Code? Open in Web Editor NEW

1.6K 106.0 403.0 618.69 MB

Machine Learning in R

Home Page: https://mlr.mlr-org.com

License: Other

R 97.99% C 0.31% Shell 0.05% HTML 1.65%

machine-learning data-science tuning cran r-package predictive-modeling classification regression statistics r

mlr's People

Contributors

Stargazers

Watchers

Forkers

ruit jianf7 ben-haim daron-wan mmadsen narayana1208 spock74 linearregression arturochian libardo1 devendradesale rachanabagde christinacyris rishy optixlab danula sandipank dickoa erenturunc nitinl anandaramanan saurabh7 arinik9 gandalf-thegrey igauravsehrawat dalpozz luizfox aydindemircioglu dmoliveira wintercmin gvanzin hjanime dwaynepaschall micseb zachmayer elephann ppr10 thomaskisler luism78 juriskl dagola daisukeichikawa ljudt 99701a0554 florianfendt dshen1 hetong007 mnwright dukuiran suvojyoti linxihui gragusa hoardboard junjiemao emaasit ele0nard albert1988 abhik1368 sandy4321 sandeep433 pjpan ecrutchfield yang-tradelab ageek tris-sondon adeshola quantscientist3 rahulsam92 mikajoh tijoseymathew joelm chuhoting rahul-sindhu kerschke dlazarou pieces201020 ssrajpurohit coorsaa christophergandrud nurur philippstats imagingearth mariaerdmann jimhester 000nelson000 dotterbart philipppro rahmanhafeezul houcy saikath1 hudsonchaves danielkuehn87 yangkangyf stevebronder slang1998 disha7003 thanhleviet fabulthuis giuseppec saurabhbhatt

mlr's Issues

Carefully check all parameters / param set of classif.nnet and regr.nnet

I think there were some minor issues with lower bounds or other stuff

explain configureMlr briefly in tutorial if not already done

Totorial links to html R docs currently do not work

Click on the links on any page of the tutorial

Add multi criteria optimization support for tuning hyper parameters

We should consistently define how a slot is stored that is "not there / not used" in S3

Options:

NULL

e.g. numeric(0)

[survival] getTaskFormula: interface change required

Using the survival::Surv function on the LHS of the formula is the preferred way to construct formulas required by most survival packages as this does not inflict copies of the input data.
But the argument delete.envis a hindrance here: with no environment attached the survival package is not in the search path and the function lookup will fail. On the other hand, I'd like to not carry these environments around for obvious reasons.

Is it okay to touch the interface of this function? The parameter delete.envis never used in mlr, but might be used in other projects.
I'd opt to replace with new parameter env defaulting to NULL or emptyenv(). I could then set this to as.environment("package:survival") which should have a similar effect but will allow the function lookup.

Feature request: getMlrOptions

Output as list or a generic for conversion would be nice.

Stratified CV does not distribute observations to folds equally

This code snippet is called on each class label separately:

instantiateResampleInstance.CVDesc = function(desc, size) {
  test.inds = sample(size)
  # don't warn when we can't split evenly
  test.inds = suppressWarnings(split(test.inds, seq_len(desc$iters)))
  makeResampleInstanceInternal(desc, size, test.inds=test.inds)
}

Remaining obs are distributed to first folds. After joining the separate splits you can end up with up to [iters] more observations in the first fold than in the others.

Nested Resampling

How can I implement a version of nested resampling?

So far, I'm splitting my data into training and test sets (using method = "subsample").
Now, I want to run a feature selection on the training sets, using crossvalidations. Afterwards, I want to evaluate my results on the test sets of the subsamples.

Unfortunately, I can't find anything similar in the tutorial.

Feature col returned in getTaskData(target.extra = TRUE, ...)

https://github.com/berndbischl/mlr/blob/master/R/SupervisedTask_operators.R#L140

I guess this is unwanted?

Tutorial / tuning: check that it is explained how the whole opt path is accessed.

Maybe show and example of a grid search and convert the the opt path to a data.frame.

Maybe for normal tuning and wrappers.

Users must simply understand how to get all evaluated points.

Tutorial proof-reading thread

Checked by Bernd:

use do.call2 in the correct places to save copies

fu** R.

Tutorial: set show.info=FALSE to reduce some unnecessary output

In some cases, e.g., calling resample in later tutorial sections, show.info should be set to false, so we do not get so much crap on the page.

Only, when he output is very long and the reader does not really gain any additional understanding from seeing it.

Explain over and undersampling in tutorial / cool stuff

Probability thresholding should be explained in a better way in help

Also in tutorial.

What if the user wants to set a certain constant threshold value for a learner?
Wasnt there an option for that? Check again.

Extract probability matrix from prediction obejct

It seems helpful to either have a method to extract the probability matrix from a prediction object oder to store it directly as a matrix / data.frame.

Example

learner <- makeLearner('classif.lda', predict.type="prob")
task <- makeClassifTask(data=iris, target="Species")
mod <- train(learner=learner, task=task)
pred.obj <- predict(mod, newdata=iris)
as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])

This does not seem like an elegant solution, neither does anything I can come up with at the moment.

I think pred.obj$pred should return a matrix. But I don't know how that would interfere with existing methods.

MultiClass AUC

See wether we can create an AUC measure for more than 2 classes in mlr.

Here is a hint, sent by Markus by mail.

 learner <- makeLearner('classif.lda', predict.type="prob")
 task <- makeClassifTask(data=iris, target="Species")
 mod <- train(learner=learner, task=task)
 pred.obj <- predict(mod, newdata=iris)
 library(HandTill2001)
 predicted <- as.matrix(pred.obj$data[,paste("prob.", levels(pred.obj$data$response), sep="")])
 colnames(predicted)<-levels(pred.obj$data$response)
 auc(multcap(response=pred.obj$data$response, predicted=predicted))

Investigate
Add Measure, doc. and test it
Briefly describe in tutorial / ROC part.

Better "visualization" of feature forward / backward selection

Basically, one wants to see which feature gets add removed and how that changes performance

The code to get started is here:

https://github.com/berndbischl/mlr/blob/master/R/analyzeFeatSelResult.R

Only the first 2 functions, the rest should be checked and possibly removed if not so useful.

make tutorial somehow knits wrong

Somehow the make tutorial script does not work correctly. Check the performance page
Using knitr manualy out of RStudio works fine.

add bench.exp again to compare learner on different tasks

Code is in todo-files/benchmark

add "range" as a new aggregation function

Similar to this

my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))

Possibly export makeAggregation so the user can do this, too.

Also explain how to do this in tutorial

mlr seems to use ParallelMap, which also ignores show.info.

result = resample(learner=lrn, task=tsk, resampling=rsmpl, show.info=FALSE)
Loading packages on slaves: mlr

Or look over here to see what happens.

Feature filtering

Check that filtering is nicely explained in the tutorial
Can we access the filtered features after training of filter wrapper
add MRMR, maybe also fmrmr

Proposition: Add methods from package 'DiscriMiner'

Methods as e.g. plsDA, geoDA. Perhaps take a look if it's in general an interesting package. Last update was in November 2013.

Also linDA and quDA are available, but I don't know the difference towards MASS lda or qda that already exists in mlr.

Not every learner is compatible with makeBaggingWrapper()

For instance

library("mlr")
data(iris)
tsk = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.fnn")
bagLrn = makeBaggingWrapper(lrn, bag.iters=5, bag.replace=TRUE, bag.size=0.6, bag.feats=3/4, predict.type="prob")
rsmpl = makeResampleDesc("RepCV", reps=5, fold=2)
resample(learner=bagLrn, task=tsk, resampling=rsmpl)

[Resample] repeated cross-validation iter: 1
Fehler in (function (train, test, cl, k = 1, prob = FALSE, algorithm = c("kd_tree",  : 
  dims of 'test' and 'train' differ

I think the predictor dislikes the fact that he gets the full dataset with variables not used while learning (bag.feats=3/4).

integrate LibLinear

http://cran.r-project.org/web/packages/LiblineaR/

show.info does not get passed to makeTuneWrapper

Not that tragic. But see here

r = resample(lrn, task, resampling = rout, extract = getTuneResult, show.info = FALSE)

Will generate a lot of output.

Add over / undersampling wrapper to mlr

For imbalanced classes. What are good and simple stragegies here?

filterFeatures: check in code and document which methods are useful for which tasks

Some methods can be used for regression, some for classification.
Some work with categorical, some with numerical, some with mixed feature sets.

Check this in code and document it on help page.

check printers for TuneResult and FeatSel result

Whether they look right

Improve R doc page of makeFeatSelControl

Proof-read whole page
Explanation of forward / backward search is basically non-existent, write it
Cross-check with respective part in tutorial

Suppress warnings in learners?

Can sometimes be annoying. Add option to configureMlr?

Reread Michel's new imputation code and add a section in tutorial

read code
read roxygen help
correct errors in both and extend docs a bit
add section in tutorial to explain how it works

these are the files:

Impute.R
ImputeMethods.R
PreprocImputeWrapper.R

Also ImputeWrapper is probably a better and short name than PreprocImputeWrapper.

Unifiy interface of "preprocessing operations before training"

We already have a couple of those:

impute
filter features
over/undersample
what else?

We have to make a list, than make the interface the same, so like

doTheOp(obj, data, target) : generic

doTheOp.data.frame

doTheOp.task

makeOpWrapper: internally calls doTheOp

getOpResults(model): allows the user to access the operation results when the wrapper was trained

performance function: Allow list of measures and return vector of numerics

Do we really need the special code in regr.randomForest for se estimation if we have that in BaggingWrapper?

Describe how ROC curves can be plotted with mlr ROCR

Construct example, 2 class problem from mlbench, 2 learners.

Crossvalidiate and compare ROC curves in one plot.

Add this example to the tutorial part and in asROCRPredictions.R function @example.

ROCR has examples to show how the plot is constructed, copy a simple one after calling asROCRPredictions.

Add multi criteria optimization support for feature selection

caret::train can shadow mlr::train

It is annoying if a learner from caret is loaded, caret's train methods shadows mlr's method. This only happens in the global user namespace, but it leads to an unintuitiv error message for users.

I currently see no real fix except for renaming - which I dislike.
Lets think about it.

move some useful stuff from mlrEDA to mlr

use ```splus instead of ```r

In all github wiki / readme / tutorial files

Add a (web / github) example to show multicriteria evaluation with mlr

Here is an example how to simultaneously look at mmce and the range of errors over resampling.

library(mlr)
library(mlbench)
library(ggplot2)
task = makeClassifTask(data=iris, target="Species")
lrn = makeLearner("classif.rpart")
rdesc = makeResampleDesc("CV", iters=2)
ms1 = mmce
my.range.aggr = mlr:::makeAggregation(id="test.range", 
  fun = function (task, perf.test, perf.train, measure, group, pred) max(perf.test) - min(perf.test))
ms2 =  setAggregation(mmce, my.range.aggr)
res = selectFeatures(lrn, task, rdesc, measures=list(ms1, ms2),  control=makeFeatSelControlExhaustive())
perf.data = as.data.frame(res$opt.path)
p = ggplot(aes(x=mmce.test.mean, mmce.test.range), data=perf.data) + 
  geom_point()
print(p)

Better describe in the tutorial how this works, probably in the "learner" part.
train and resample allow the passing of the weights.
tuneParams, selectFeatures and the corresponding wrappers do not.
Discuss and then extend. Maybe one wants to set the weights also in the task? Less annoying in some cases.

parallelMap: level for feture selection is missing

getOptPathEl(op, index)$y does not have a name attribute

It might not be really needed but it seems like it had been there and at least analyzeFeatSelResult depended on it.

names(getOptPathEl(opt.path, i))
> NULL