Comments (18)
Maybe to add on to this:
One of the current weaknesses of mlr
is IMHO that we can not really use it for purposes like text mining, where we would require sparse matrices. (Bag-of-Words etc.)
I think we really should find a solution for this within mlr3
.
from mlr3.
However, there are some (important) algorithms which can deal with features being either numeric or factors. This is true for all tree-based methods.
That's what I meant by "almost all" :-)
from mlr3.
Hi Dmitriy!
It is kinda funny that you are showing up here :) And I am glad, as that saves me one email that I should have written. When we had a meeting about mlr3's design some weeks back, we googled again for some ML packages in R, and saw mlapi. I had the feeling the ideas or maybe better: goals that we (you and us) had were quite similar. I wanted to email you at that point, whether you maybe wanted to join the discussion here, or maybe even contribute.
Also: In a very short time we would like to announce mlr3 on R bloggers so other people could step in or criticize.
from mlr3.
First of all I'm very glad to see that mlr converged to use R6. IMHO there is no need to re-invent the wheel and in R community we need just to leverage design of super-successful scikit-learn
We do not want to reinvent the wheel, exactly the opposite. OTOH we don't want to "blindly" copy over design decisions from other projects.
In summary: We are looking at a project like sklearn very closely (I also talk to @amueller quite often). If there are good design decisions, we should be MORE aware of, in your opinion, please state them at any point in time. But I also think that if sklearn could start fresh, they might also do some things differently.
Anyway: Lets discuss.
provide only interface
approach by caret and mlr was not entirely correct. We can't wrap every useful pkg and re-create API/interface.
can you explain a bit more in detail where the mistake have been, in your opionion.
implement everything internally as its done in scikit-learn
i am 100% certain that that is not feasible. or even reasonable.
- provide interface and some utils to help reduce boilerplate coding and let other developers follow it
we want this. OTOH hand we might provide the standard boilerplate code for some stuff that we like and want to maintain ourselfes. as without this, we would think mlr3 is less useful.
I want to add more on that topic later, it is important
also note that this does not only concern learning algorithms. there are objects like measures, preproc steps, plots, for which the same type of discussion can be had
from mlr3.
Use correct level of abstraction. Here I strongly believe we have to stick to matrices (dense and sparse from
Matrix
pkg) as it's done in scikit-learn. On top of that we may implement "transformers" to construct design matrices from data.frames.
i am really REALLY unsure whether that is correct. and this goes angainst what we currently have. please argue very carefully and in detail here if you really believe that.
from mlr3.
My personal view (and not the view of the sklearn community as a whole) is that users would be better off if sklearn estimators supported dataframes more natively.
Max Kuhn is also working on a new ml API, kind of a successor of carret, that might be interesting to look at.
from mlr3.
https://github.com/topepo/parsnip
from mlr3.
Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.
There are many ways how to encode categorical variables, impute missing values, etc. One example from @pfistfl above - text data. You need to encode it as sparse matrix. And then pass it to some downstream solver which works with matrices.
This is part of modeling and should be reproducible - easy apply to new data. The more details hidden - the more confusion possible.
If there is demand for API for casual users - this can be build on top. At the end we all need to keep in mind that packages are build not by casual users. The success of such a meta-package depends on the level of adoption across developer community.
from mlr3.
Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.
I agree. However, there are some (important) algorithms which can deal with features being either numeric or factors. This is true for all tree-based methods. Converting to a sparse matrix is feasible here via dummy encoding, but would likely have a negative impact on both prediction performance and computational performance. So I strongly object using matrices as the default format to store data.
I'm not saying that we do not want support for sparse data. This is why the data backend is exchangeable. It took me ~15 mins to prototype a backend for sparse data here: https://github.com/mlr-org/mlr3/tree/sparse_backend
This is not well tested and not completely integrated yet. However, in the future you can tag learning algorithms with "sparse" to automatically handle format conversions:
- If the backend is sparse and the learner is capable of working on sparse data: no conversion
- If the backend is a data.table and the learner needs sparse data: convert data.table -> sparse; we need something simple here. More advanced stuff could be done in a pipeline.
- If the backend is a sparse and the learner needs a data.table: convert sparse -> data.table
- [...]
from mlr3.
dt -> sparse: https://www.rdocumentation.org/packages/mltools/versions/0.3.5/topics/sparsify
We sort of need all preprocessing and pipeline steps to also support this.
from mlr3.
That's what I meant by "almost all" :-)
These are really important, though! And scikit-learn has a bunch of issues because this is now basically impossible by design. Also, at least in python, there are no column names in matrices, and in most cases it's helpful to have semantic column names.
Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.
I'm all for explicit. But I don't think being explicit is in conflict with using dataframes. Doing one-hot encoding for example discards which dummy encode the same categorical variable and that makes it basically impossible to do semantically meaningful feature selection afterwards.
from mlr3.
I haven't made closer look at mentioned sparse data structure, but while ago I made one, built on top of data.table. Simply by storing (sparse) multidimensional data as table modelled in star schema. Access is very fast because table is always sorted. https://gitlab.com/jangorecki/data.cube
Regarding data.table-matrix conversion, I am pretty sure there is a space for improvement here. So far both methods are implemented in R, pushing down to C shouldn't be very difficult, also using parallelism (by column) should be possible.
from mlr3.
Do you plan to do this sparse data support for end 2 end ? If it involves sparse -> dense conversion internally or explicitly outside means memory explosion. So I prefer native support of sparse matrix end 2 end. ScikitLearn supports Sparse DataFrame, how about you also support sparse data.table ?
from mlr3.
Do you plan to do this sparse data support for end 2 end ? If it involves sparse -> dense conversion internally or explicitly outside means memory explosion. So I prefer native support of sparse matrix end 2 end.
If the backend stores the data in a sparse format, and the model accepts sparse data, there will be no internal sparse->dense conversion.
However, there is no standard sparse format in R. E.g., while most models support the Matrix
sparse format, xgboost implements its very own format so a conversion is unavoidable here.
We could implement a backend especially for xgboost though, but then you would also need to create a task just for xgboost.
ScikitLearn supports Sparse DataFrame, how about you also support sparse data.table ?
I'm not sure what you have in mind. Do you mean the suggested data.cube
?
from mlr3.
from mlr3.
from mlr3.
If there is standard sparse matrix class then best would be to provide fast C conversion between data.table or matrix to such class. Another thing is that such sparse class (a package) should be lightweight, and be committed to stay lightweight.
from mlr3.
Well, it would be nice to have some sparse data format which can also store categorical features natively. If there would be an inexpensive conversion to dgCMatrix (e.g., after one-hot encoding the categorical features), we could do this on demand and it would be less code to maintain in mlr3.
from mlr3.
Related Issues (20)
- can't load mlr3verse library HOT 1
- Better document how to score train performance HOT 2
- Docs: better documentation how to suppress learner output HOT 2
- Resampling: cv-plus HOT 1
- set_col_roles has confusing error for user-defined roles HOT 10
- Release mlr3 0.20.0
- How to predict in new data in mlr3 0.20.0 HOT 3
- Retrieve data from learner HOT 6
- Release mlr3 0.20.1
- Licensing issue due to mlr3 dependnency RhpcBLASctl HOT 3
- Release mlr3 0.20.2
- Unexpected Feature Inclusion in mlr3 Task Definition HOT 4
- Allow Multiple Ratios in partition() for Creating Training, Testing, and Validation Sets HOT 2
- learner predict HOT 1
- `MeasurerElapsedTime` uses "mean" for aggregation the total time, while "sum" would be more appropriate HOT 1
- Accessing `col_info` is a bottleneck
- Set environment variables to disable parallelization
- remove some unneeded marshal tests
- Add FAQ on workaround to use model trained with old mlr3 version
- Add `set_internal_tuning()` helper method for configuring internal hyperparameter optimization
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlr3.