mlr-org / mlr3 Goto Github PK

View Code? Open in Web Editor NEW

914.0 914.0 84.0 33.51 MB

mlr3: Machine Learning in R - next generation

Home Page: https://mlr3.mlr-org.com

License: GNU Lesser General Public License v3.0

R 99.60% TeX 0.40%

classification data-science machine-learning mlr3 r r-package regression

mlr3's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare zzawadz longhua8800w pinto-p kdpsingh spencerx be-marc cgmossa vpolisky emailhy ukrcherry minghao2016 rnotonlyr jlw123456 dpcscience raphaels1 royahe kant octaviodeliberato sebggruber 001ben iannimuliterno mboecker hehuanshu96 sonthuybacha jamesjin63 pingtaoyi manalytics henrikbengtsson tylergrantsmith jeverding tpielok gsenseless duanxiaoqian tdhock mli362 caifuwang sanyam07 wjl27 httovar damirpolat palexbg andschar pierrecamilleri chencaf lazuardifirdaus lorenzwalthert oleled nirvananimbusa sergejhorvat owain-s moorezhu bigclever1 hayeszhou michaelchirico philippbach affdk kjohn22 sebffischer vikash84 jmiahjones liuhai84 liuyuefe gangcc137 fgazzelloni philipp-baumann ludwigbothmann jemus42 hadley dhintz137 kkpan11 hannahbus raykodali horaciogacevedo teread m-muecke magicdefender allenlile vortable jim302302 bettinagruen

mlr3's Issues

scaling well in mlr 3

This would be nice if packages of pbdR are utilized in mlr for efficient scalability.

Task: why is default.measure and default.prediction a slot?

is this a good idea? maybe it is.

should a learner have a model as a state member variable?

this matches better with mlrPipelines

[discussion] scope of mlr3

Hi here. I would like to start discussion about mlr3 scope - hope it can be useful for community.

First of all I'm very glad to see that mlr converged to use R6. IMHO there is no need to re-invent the wheel and in R community we need just to leverage design of super-successful scikit-learn. For this reason I've created mlapi pkg which almost mimics scikit-learn API. I use it in https://github.com/dselivanov/text2vec and https://github.com/dselivanov/rsparse.

I believe that such core pkg as mlr3 should:

provide only interface other pkgs should follow. We have a zoo of pkgs implementing ML algorithms, quality and interfaces vary a lot. I think it is obvious now that the approach taken by caret and mlr was not entirely correct. We can't wrap every useful pkg and re-create API/interface.

So I believe essentially if pkg aims to be a standard pkg for ML in R there 2 choices:

implement everything internally as its done in scikit-learn
provide interface and some utils to help reduce boilerplate coding and let other developers follow it (despite I personally don't like tydyverse it is a good example of how this approach can successfully work)

Use correct level of abstraction. Here I strongly believe we have to stick to matrices (dense and sparse from Matrix pkg) as it's done in scikit-learn. On top of that we may implement "transformers" to construct design matrices from data.frames.

format arg for task$data()

task$data()currently only returns data.table's.
In cases where the backend is a sparse matrix, we would like to supply a format = sparse arg, so the data is returned in sparse-matrix format.

  requireNamespace("Matrix")
  data = Matrix::Matrix(sample(0:1, 30, replace = TRUE), ncol = 3, sparse = TRUE)
  colnames(data) = c("x1", "x2", "target")
  rownames(data) = paste0("row_", 1:10)
  b = as_data_backend(data)
  task = TaskRegr$new(id = "spmat", b, target = "target")

Instead of:

 
  d = task$data()
  task$backend$data(paste0("row_", seq_len(nrow(d))), colnames(d), format = "sparse")

i would like to do:

task$data(format = "sparse")

I can try to create a PR, if required.

Implement $aggregated for BenchmarkResult

ClassifTask: rename AB "classes" to "class_names"

maybe this is best
class_names
class_positive
class_n

Why are assert_ functions not generated with makeAssertionFunction()

Would it make sense to have this consistent with checkmate?

Add all the hidden files to .Rbuildignore

Clutters the Appveyor builds: https://ci.appveyor.com/project/mllg/mlr3/build/1.0.83#L470

rename key(s) to id(s)

In Dictionary, we use key and everywhere else, we use id. Should probably be always id.

Store measures as part of experiment, not as part of task?

Storing measures as part of the task looked like a good idea first, but complicates things for benchmarks where there are different tasks and different measures. Could be caught at the start of benchmark(), but we generally want to be able to fuse rather arbitrary experiments into a BenchmarkResult. Possible next steps:

Measures stay as-is, part of the task. Determine the union of all measures in a BenchmarkResult, and calculate missing scores on-demand.
Measures are stored as part of experiment, defaulting to the measures of the task. When experiments are combined, update experiments and calculate missing scores.

I currently tend to prefer (2). With (1), there is no natural location to store the measure object, except altering the task which is awkward.

export assert_ functions

We need them probably in all other packages.

Default return / improved error message for rr$experiment()

When calling rr$experiment() or rr$experiments() without an argument, we should either have a default way (e.g. iter = 1) including a message or print a custom error message.

library(mlr3)
task = mlr_tasks$get("iris")
learner = mlr_learners$get("classif.rpart")

resampling = mlr_resamplings$get("cv")
resampling$param_vals = list(folds = 3)
rr = resample(task, learner, resampling)
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 1/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 2/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 3/3)' ...

rr$experiment()
#> Error in assert_int(iter, lower = 1L, upper = nrow(self$data), coerce = TRUE): argument "iter" is missing, with no default

^{Created on 2018-12-18 by the reprex package (v0.2.1)}

Refactor `predict_type`

Currently, this is a single string, i.e. "response" or "prob". We need to encode that "prob" automatically includes "response".

Task creation with ordered factor variable fails

Task creation with dataset with ordered factor variable fails on TaskClassif and TaskRegr

df = data.frame(x = c(1, 2, 3), y = factor(c(1, 2, 3), ordered = TRUE), z = c("M", "R", "R"))
b = as_data_backend(df)
TaskClassif$new(id = "id", backend = b, target = "z")

throws:

Error in vapply(.x, .f, FUN.VALUE = .value, USE.NAMES = FALSE, ...) : 
  values must be length 1,
 but FUN(X[[2]]) result is length 2

It works fine without ordered factor variable

Set predict_type and other options for multiple learners in a benchmark setting

# get some example tasks
tasks = mlr_tasks$mget(c("pima", "sonar", "spam"))

# get a featureless learner and a classification tree
learners = mlr_learners$mget(c("classif.featureless", "classif.rpart"))

# let the learners predict probabilities instead of class labels (required for AUC measure)
learners$classif.featureless$predict_type = "prob"
learners$classif.rpart$predict_type = "prob"

I would like to do learners$predict_type = "prob" instead of setting the predict_type (and other options) for each learner.

This would probably require learners to be an R6 class with a method of handling certain slots of it's children?

naming convetions for learners

do we want classif.rpart? classif_rpart?

Add a packageStartupMessage

define logging now

we really need thus, pushing that issue back even further is going to hurt us

how about futile.logging?

https://github.com/zatonovo/futile.logger

Add task$weights

... and introduce new column type 'weight'.

prediction objects

What always disliked in mlr is that manually creating prediction objects was always difficult.
Currently, it seems that we can only create prediction objects by passing a task.
Suppose I have a vector of predictions and a vector of true values. It would be great if I could use mlr3 to construct a prediction object just using this (+ possibly other information such as predict.type...).
Any thoughts?

More task mutators

See mlr-org/mlr3pipelines#19

$remove_rows()
$remove_cols()
$replace_backend()

we need to document what happens in cases of learner error

this should probably be done in a vignette

Issue labels

Can we discuss if we want to use the same ones as in mlr or a different structure?

We will probably soon have > 20 issues and should group them.

Maybe its worth taking a look into other projects how their labels are organized.

They should be somewhat generic so we can use them across all repos.

task printer should not print all feature names

can get very lengthy

why does resample / training need access to par_vals slot?

this came up when i resampled pipelines
maybe this goes away when we make the learner interface a bit more minimal?

Task: rename "cols" to "col_roles"

or "col_info"
we also dont really need "col.types" anymore? that seems totally redundant then

syntax a bit complicated

In my humble opinion the syntax is still a bit complicated, also besides the custom R6 classes.

Compare the code:

data("mtcars", package = "datasets")
b = BackendDataTable$new(data = mtcars[, 1:3])
task = TaskRegr$new(id = "cars", b, target = "mpg")

with

data("mtcars", package = "datasets")
task = makeClassifTask(data = mtcars[, 1:3], target = "mpg")

The second one seems at the moment also more intuitive to me, how should I now, that I need the "new" function to create a new task... ;)

Serve pkgdown from docs/ master instead of gh-pages branch

Similar as we do it in mlr.

per branch docs
PR previews (netlify)

properties for task need to be well defined

what is acutally a list of props we want? is the data sctructrtue a charcvec or a list (of keys and values)

packages for measures

currently we seem to say that we want to use the new "measures" package.
do we really want that? during the last call we said we wanted to rather use Metrics

https://cran.r-project.org/web/packages/Metrics/

note that on cran there is also this

https://cran.r-project.org/web/packages/hmeasure/

Error in vignette: 01-basics

Reprex:

library(mlr3)
#> The mlr3 package is currently work-in-progress. Do not use in production. The API will change. You have been warned.

mlr_tasks
#> <DictionaryTask> with 6 stored values: bh, iris, pima, sonar,
#>   spam, zoo
#> 
#> Public: add, get, has, items, keys, mget, remove

# list keys
names(mlr_tasks)
#>  [1] ".__enclos_env__" "keys"            "items"          
#>  [4] "remove"          "mget"            "initialize"     
#>  [7] "has"             "print"           "add"            
#> [10] "get"

# get a quick overview
as.data.frame(mlr_tasks)
#> Error in as.data.frame.default(mlr_tasks): cannot coerce class 'c("DictionaryTask", "Dictionary", "R6")' to a data.frame

^{Created on 2018-11-08 by the reprex package (v0.2.1)}

Set up organization account for Appveyor

Organization handling is different to Travis and a bit more cumbersome.
Currently all builds are running in Michels and mine personal accounts.

We should setup an organization account so that multiple people can manage ONE account: https://www.appveyor.com/docs/team-setup/#setting-up-appveyor-account-for-github-organization

Missing Functionality

Dimension reduction
- Feature selection
- Filtering
- Ensemble filters
Plots:
- ROC/Threshold
- Benchmark Plots
- Learner Prediction
- Calibration
- Learning Curves
Tasks:
- Cost sensitive
- Anomaly
- Multi Output
- Stacking
- FDA
- Survival
- Clustering
- Forecasting
- Spatial (i.e. coordinates)
Resampling
- Spatial CV

errors in training get muffled and i cannot see them

when trying to train a pipeline.

obviously i did something wrong but i cannot see what happened and just got a "dummy.model"

this is bad

Learner$model is not used anymore

Can it be deleted if this is the case?

avoid creating regression tasks for data with categorical targets

It is currently possible to create regression tasks with categorical target:

b = as_data_backend(iris)
TaskRegr$new("iris", backend = b, target = "Species")

Dicts: they really should have an overview printer

Prediction object could have its own R6 class?

we discussed whether a simple table is enough here.
lets look at this a bit later
i guess i still think a well-defined class is better. the object might be to important