mlr-org / mlr3 Goto Github PK
View Code? Open in Web Editor NEWmlr3: Machine Learning in R - next generation
Home Page: https://mlr3.mlr-org.com
License: GNU Lesser General Public License v3.0
mlr3: Machine Learning in R - next generation
Home Page: https://mlr3.mlr-org.com
License: GNU Lesser General Public License v3.0
This would be nice if packages of pbdR are utilized in mlr for efficient scalability.
is this a good idea? maybe it is.
this matches better with mlrPipelines
Hi here. I would like to start discussion about mlr3 scope - hope it can be useful for community.
First of all I'm very glad to see that mlr converged to use R6. IMHO there is no need to re-invent the wheel and in R community we need just to leverage design of super-successful scikit-learn. For this reason I've created mlapi pkg which almost mimics scikit-learn API. I use it in https://github.com/dselivanov/text2vec and https://github.com/dselivanov/rsparse.
I believe that such core pkg as mlr3 should:
provide only interface other pkgs should follow. We have a zoo of pkgs implementing ML algorithms, quality and interfaces vary a lot. I think it is obvious now that the approach taken by caret and mlr was not entirely correct. We can't wrap every useful pkg and re-create API/interface.
So I believe essentially if pkg aims to be a standard pkg for ML in R there 2 choices:
Use correct level of abstraction. Here I strongly believe we have to stick to matrices (dense and sparse from Matrix
pkg) as it's done in scikit-learn. On top of that we may implement "transformers" to construct design matrices from data.frames.
task$data()
currently only returns data.table
's.
In cases where the backend is a sparse matrix, we would like to supply a format = sparse
arg, so the data is returned in sparse-matrix format.
requireNamespace("Matrix")
data = Matrix::Matrix(sample(0:1, 30, replace = TRUE), ncol = 3, sparse = TRUE)
colnames(data) = c("x1", "x2", "target")
rownames(data) = paste0("row_", 1:10)
b = as_data_backend(data)
task = TaskRegr$new(id = "spmat", b, target = "target")
Instead of:
d = task$data()
task$backend$data(paste0("row_", seq_len(nrow(d))), colnames(d), format = "sparse")
i would like to do:
task$data(format = "sparse")
I can try to create a PR, if required.
maybe this is best
class_names
class_positive
class_n
Would it make sense to have this consistent with checkmate?
Clutters the Appveyor builds: https://ci.appveyor.com/project/mllg/mlr3/build/1.0.83#L470
In Dictionary, we use key
and everywhere else, we use id
. Should probably be always id
.
Storing measures as part of the task looked like a good idea first, but complicates things for benchmarks where there are different tasks and different measures. Could be caught at the start of benchmark()
, but we generally want to be able to fuse rather arbitrary experiments into a BenchmarkResult. Possible next steps:
I currently tend to prefer (2). With (1), there is no natural location to store the measure object, except altering the task which is awkward.
We need them probably in all other packages.
When calling rr$experiment()
or rr$experiments()
without an argument, we should either have a default way (e.g. iter = 1) including a message or print a custom error message.
library(mlr3)
task = mlr_tasks$get("iris")
learner = mlr_learners$get("classif.rpart")
resampling = mlr_resamplings$get("cv")
resampling$param_vals = list(folds = 3)
rr = resample(task, learner, resampling)
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 1/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 2/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 3/3)' ...
rr$experiment()
#> Error in assert_int(iter, lower = 1L, upper = nrow(self$data), coerce = TRUE): argument "iter" is missing, with no default
Created on 2018-12-18 by the reprex package (v0.2.1)
Currently, this is a single string, i.e. "response" or "prob". We need to encode that "prob" automatically includes "response".
Task creation with dataset with ordered factor variable fails on TaskClassif and TaskRegr
df = data.frame(x = c(1, 2, 3), y = factor(c(1, 2, 3), ordered = TRUE), z = c("M", "R", "R"))
b = as_data_backend(df)
TaskClassif$new(id = "id", backend = b, target = "z")
throws:
Error in vapply(.x, .f, FUN.VALUE = .value, USE.NAMES = FALSE, ...) :
values must be length 1,
but FUN(X[[2]]) result is length 2
It works fine without ordered factor variable
# get some example tasks
tasks = mlr_tasks$mget(c("pima", "sonar", "spam"))
# get a featureless learner and a classification tree
learners = mlr_learners$mget(c("classif.featureless", "classif.rpart"))
# let the learners predict probabilities instead of class labels (required for AUC measure)
learners$classif.featureless$predict_type = "prob"
learners$classif.rpart$predict_type = "prob"
I would like to do learners$predict_type = "prob"
instead of setting the predict_type (and other options) for each learner.
This would probably require learners
to be an R6 class with a method of handling certain slots of it's children?
do we want classif.rpart? classif_rpart?
we really need thus, pushing that issue back even further is going to hurt us
how about futile.logging?
... and introduce new column type 'weight'.
What always disliked in mlr is that manually creating prediction objects was always difficult.
Currently, it seems that we can only create prediction objects by passing a task.
Suppose I have a vector of predictions and a vector of true values. It would be great if I could use mlr3 to construct a prediction object just using this (+ possibly other information such as predict.type...).
Any thoughts?
$remove_rows()
$remove_cols()
$replace_backend()
this should probably be done in a vignette
Can we discuss if we want to use the same ones as in mlr or a different structure?
We will probably soon have > 20 issues and should group them.
Maybe its worth taking a look into other projects how their labels are organized.
They should be somewhat generic so we can use them across all repos.
can get very lengthy
this came up when i resampled pipelines
maybe this goes away when we make the learner interface a bit more minimal?
or "col_info"
we also dont really need "col.types" anymore? that seems totally redundant then
In my humble opinion the syntax is still a bit complicated, also besides the custom R6 classes.
Compare the code:
data("mtcars", package = "datasets")
b = BackendDataTable$new(data = mtcars[, 1:3])
task = TaskRegr$new(id = "cars", b, target = "mpg")
with
data("mtcars", package = "datasets")
task = makeClassifTask(data = mtcars[, 1:3], target = "mpg")
The second one seems at the moment also more intuitive to me, how should I now, that I need the "new" function to create a new task... ;)
Similar as we do it in mlr
.
what is acutally a list of props we want? is the data sctructrtue a charcvec or a list (of keys and values)
currently we seem to say that we want to use the new "measures" package.
do we really want that? during the last call we said we wanted to rather use Metrics
https://cran.r-project.org/web/packages/Metrics/
note that on cran there is also this
Reprex:
library(mlr3)
#> The mlr3 package is currently work-in-progress. Do not use in production. The API will change. You have been warned.
mlr_tasks
#> <DictionaryTask> with 6 stored values: bh, iris, pima, sonar,
#> spam, zoo
#>
#> Public: add, get, has, items, keys, mget, remove
# list keys
names(mlr_tasks)
#> [1] ".__enclos_env__" "keys" "items"
#> [4] "remove" "mget" "initialize"
#> [7] "has" "print" "add"
#> [10] "get"
# get a quick overview
as.data.frame(mlr_tasks)
#> Error in as.data.frame.default(mlr_tasks): cannot coerce class 'c("DictionaryTask", "Dictionary", "R6")' to a data.frame
Created on 2018-11-08 by the reprex package (v0.2.1)
Organization handling is different to Travis and a bit more cumbersome.
Currently all builds are running in Michels and mine personal accounts.
We should setup an organization account so that multiple people can manage ONE account: https://www.appveyor.com/docs/team-setup/#setting-up-appveyor-account-for-github-organization
when trying to train a pipeline.
obviously i did something wrong but i cannot see what happened and just got a "dummy.model"
this is bad
Can it be deleted if this is the case?
It is currently possible to create regression tasks with categorical target:
b = as_data_backend(iris)
TaskRegr$new("iris", backend = b, target = "Species")
we discussed whether a simple table is enough here.
lets look at this a bit later
i guess i still think a well-defined class is better. the object might be to important
Unsure if we really need this, but we could also encapsulate the "score" via evaluate
or callr
. Then we need to also store the score_log
.
Presumably easy to implement, but not very urgent.
This allows auto-incrementation. Rownames can still be saved as extra column. Still unsure about this...
.onAttach
?You should be able to set options via mlr_options([new_opts])
. Additionally, with_mlr_options()
would be nice to have.
maybe a bit more concrete...
also remove score() type top-level-functions and just call e$score()
as we want to control what / how much is logged in the result
Tibble dependency has been removed (tidyverse/purrr#577).
i find this much clearer, "backend" is very generic
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.