mlr-org / mlr3 Goto Github PK
View Code? Open in Web Editor NEWmlr3: Machine Learning in R - next generation
Home Page: https://mlr3.mlr-org.com
License: GNU Lesser General Public License v3.0
mlr3: Machine Learning in R - next generation
Home Page: https://mlr3.mlr-org.com
License: GNU Lesser General Public License v3.0
... and introduce new column type 'weight'.
Clutters the Appveyor builds: https://ci.appveyor.com/project/mllg/mlr3/build/1.0.83#L470
what is acutally a list of props we want? is the data sctructrtue a charcvec or a list (of keys and values)
Storing measures as part of the task looked like a good idea first, but complicates things for benchmarks where there are different tasks and different measures. Could be caught at the start of benchmark()
, but we generally want to be able to fuse rather arbitrary experiments into a BenchmarkResult. Possible next steps:
I currently tend to prefer (2). With (1), there is no natural location to store the measure object, except altering the task which is awkward.
do we want classif.rpart? classif_rpart?
we discussed whether a simple table is enough here.
lets look at this a bit later
i guess i still think a well-defined class is better. the object might be to important
It is currently possible to create regression tasks with categorical target:
b = as_data_backend(iris)
TaskRegr$new("iris", backend = b, target = "Species")
Reprex:
library(mlr3)
#> The mlr3 package is currently work-in-progress. Do not use in production. The API will change. You have been warned.
mlr_tasks
#> <DictionaryTask> with 6 stored values: bh, iris, pima, sonar,
#> spam, zoo
#>
#> Public: add, get, has, items, keys, mget, remove
# list keys
names(mlr_tasks)
#> [1] ".__enclos_env__" "keys" "items"
#> [4] "remove" "mget" "initialize"
#> [7] "has" "print" "add"
#> [10] "get"
# get a quick overview
as.data.frame(mlr_tasks)
#> Error in as.data.frame.default(mlr_tasks): cannot coerce class 'c("DictionaryTask", "Dictionary", "R6")' to a data.frame
Created on 2018-11-08 by the reprex package (v0.2.1)
# get some example tasks
tasks = mlr_tasks$mget(c("pima", "sonar", "spam"))
# get a featureless learner and a classification tree
learners = mlr_learners$mget(c("classif.featureless", "classif.rpart"))
# let the learners predict probabilities instead of class labels (required for AUC measure)
learners$classif.featureless$predict_type = "prob"
learners$classif.rpart$predict_type = "prob"
I would like to do learners$predict_type = "prob"
instead of setting the predict_type (and other options) for each learner.
This would probably require learners
to be an R6 class with a method of handling certain slots of it's children?
when trying to train a pipeline.
obviously i did something wrong but i cannot see what happened and just got a "dummy.model"
this is bad
i find this much clearer, "backend" is very generic
as we want to control what / how much is logged in the result
maybe this is best
class_names
class_positive
class_n
maybe a bit more concrete...
also remove score() type top-level-functions and just call e$score()
$remove_rows()
$remove_cols()
$replace_backend()
Similar as we do it in mlr
.
task$data()
currently only returns data.table
's.
In cases where the backend is a sparse matrix, we would like to supply a format = sparse
arg, so the data is returned in sparse-matrix format.
requireNamespace("Matrix")
data = Matrix::Matrix(sample(0:1, 30, replace = TRUE), ncol = 3, sparse = TRUE)
colnames(data) = c("x1", "x2", "target")
rownames(data) = paste0("row_", 1:10)
b = as_data_backend(data)
task = TaskRegr$new(id = "spmat", b, target = "target")
Instead of:
d = task$data()
task$backend$data(paste0("row_", seq_len(nrow(d))), colnames(d), format = "sparse")
i would like to do:
task$data(format = "sparse")
I can try to create a PR, if required.
This allows auto-incrementation. Rownames can still be saved as extra column. Still unsure about this...
Organization handling is different to Travis and a bit more cumbersome.
Currently all builds are running in Michels and mine personal accounts.
We should setup an organization account so that multiple people can manage ONE account: https://www.appveyor.com/docs/team-setup/#setting-up-appveyor-account-for-github-organization
We need them probably in all other packages.
Can it be deleted if this is the case?
You should be able to set options via mlr_options([new_opts])
. Additionally, with_mlr_options()
would be nice to have.
In Dictionary, we use key
and everywhere else, we use id
. Should probably be always id
.
Hi here. I would like to start discussion about mlr3 scope - hope it can be useful for community.
First of all I'm very glad to see that mlr converged to use R6. IMHO there is no need to re-invent the wheel and in R community we need just to leverage design of super-successful scikit-learn. For this reason I've created mlapi pkg which almost mimics scikit-learn API. I use it in https://github.com/dselivanov/text2vec and https://github.com/dselivanov/rsparse.
I believe that such core pkg as mlr3 should:
provide only interface other pkgs should follow. We have a zoo of pkgs implementing ML algorithms, quality and interfaces vary a lot. I think it is obvious now that the approach taken by caret and mlr was not entirely correct. We can't wrap every useful pkg and re-create API/interface.
So I believe essentially if pkg aims to be a standard pkg for ML in R there 2 choices:
Use correct level of abstraction. Here I strongly believe we have to stick to matrices (dense and sparse from Matrix
pkg) as it's done in scikit-learn. On top of that we may implement "transformers" to construct design matrices from data.frames.
When calling rr$experiment()
or rr$experiments()
without an argument, we should either have a default way (e.g. iter = 1) including a message or print a custom error message.
library(mlr3)
task = mlr_tasks$get("iris")
learner = mlr_learners$get("classif.rpart")
resampling = mlr_resamplings$get("cv")
resampling$param_vals = list(folds = 3)
rr = resample(task, learner, resampling)
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 1/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 2/3)' ...
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris (iteration 3/3)' ...
rr$experiment()
#> Error in assert_int(iter, lower = 1L, upper = nrow(self$data), coerce = TRUE): argument "iter" is missing, with no default
Created on 2018-12-18 by the reprex package (v0.2.1)
Currently, this is a single string, i.e. "response" or "prob". We need to encode that "prob" automatically includes "response".
What always disliked in mlr is that manually creating prediction objects was always difficult.
Currently, it seems that we can only create prediction objects by passing a task.
Suppose I have a vector of predictions and a vector of true values. It would be great if I could use mlr3 to construct a prediction object just using this (+ possibly other information such as predict.type...).
Any thoughts?
we really need thus, pushing that issue back even further is going to hurt us
how about futile.logging?
can get very lengthy
Would it make sense to have this consistent with checkmate?
is this a good idea? maybe it is.
this should probably be done in a vignette
.onAttach
?In my humble opinion the syntax is still a bit complicated, also besides the custom R6 classes.
Compare the code:
data("mtcars", package = "datasets")
b = BackendDataTable$new(data = mtcars[, 1:3])
task = TaskRegr$new(id = "cars", b, target = "mpg")
with
data("mtcars", package = "datasets")
task = makeClassifTask(data = mtcars[, 1:3], target = "mpg")
The second one seems at the moment also more intuitive to me, how should I now, that I need the "new" function to create a new task... ;)
Can we discuss if we want to use the same ones as in mlr or a different structure?
We will probably soon have > 20 issues and should group them.
Maybe its worth taking a look into other projects how their labels are organized.
They should be somewhat generic so we can use them across all repos.
this came up when i resampled pipelines
maybe this goes away when we make the learner interface a bit more minimal?
or "col_info"
we also dont really need "col.types" anymore? that seems totally redundant then
Unsure if we really need this, but we could also encapsulate the "score" via evaluate
or callr
. Then we need to also store the score_log
.
Presumably easy to implement, but not very urgent.
Tibble dependency has been removed (tidyverse/purrr#577).
Task creation with dataset with ordered factor variable fails on TaskClassif and TaskRegr
df = data.frame(x = c(1, 2, 3), y = factor(c(1, 2, 3), ordered = TRUE), z = c("M", "R", "R"))
b = as_data_backend(df)
TaskClassif$new(id = "id", backend = b, target = "z")
throws:
Error in vapply(.x, .f, FUN.VALUE = .value, USE.NAMES = FALSE, ...) :
values must be length 1,
but FUN(X[[2]]) result is length 2
It works fine without ordered factor variable
currently we seem to say that we want to use the new "measures" package.
do we really want that? during the last call we said we wanted to rather use Metrics
https://cran.r-project.org/web/packages/Metrics/
note that on cran there is also this
this matches better with mlrPipelines
This would be nice if packages of pbdR are utilized in mlr for efficient scalability.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.