Hi here. I would like to start discussion about mlr3 scope - hope it can be useful for

Maybe to add on to this: One of the current weaknesses of <code class="notranslate

dt -> sparse: <a href="https://www.rdocumentation.org/packages/mltools/versions/0.3

[discussion] scope of mlr3 about mlr3 HOT 18 CLOSED

mlr-org commented on August 24, 2024 1

[discussion] scope of mlr3

from mlr3.

Comments (18)

pfistfl commented on August 24, 2024 1

Maybe to add on to this:
One of the current weaknesses of mlr is IMHO that we can not really use it for purposes like text mining, where we would require sparse matrices. (Bag-of-Words etc.)

I think we really should find a solution for this within mlr3.

from mlr3.

dselivanov commented on August 24, 2024 1

However, there are some (important) algorithms which can deal with features being either numeric or factors. This is true for all tree-based methods.

That's what I meant by "almost all" :-)

from mlr3.

berndbischl commented on August 24, 2024

Hi Dmitriy!
It is kinda funny that you are showing up here :) And I am glad, as that saves me one email that I should have written. When we had a meeting about mlr3's design some weeks back, we googled again for some ML packages in R, and saw mlapi. I had the feeling the ideas or maybe better: goals that we (you and us) had were quite similar. I wanted to email you at that point, whether you maybe wanted to join the discussion here, or maybe even contribute.

Also: In a very short time we would like to announce mlr3 on R bloggers so other people could step in or criticize.

from mlr3.

berndbischl commented on August 24, 2024

First of all I'm very glad to see that mlr converged to use R6. IMHO there is no need to re-invent the wheel and in R community we need just to leverage design of super-successful scikit-learn

We do not want to reinvent the wheel, exactly the opposite. OTOH we don't want to "blindly" copy over design decisions from other projects.

In summary: We are looking at a project like sklearn very closely (I also talk to @amueller quite often). If there are good design decisions, we should be MORE aware of, in your opinion, please state them at any point in time. But I also think that if sklearn could start fresh, they might also do some things differently.
Anyway: Lets discuss.

provide only interface

approach by caret and mlr was not entirely correct. We can't wrap every useful pkg and re-create API/interface.

can you explain a bit more in detail where the mistake have been, in your opionion.

implement everything internally as its done in scikit-learn

i am 100% certain that that is not feasible. or even reasonable.

provide interface and some utils to help reduce boilerplate coding and let other developers follow it

we want this. OTOH hand we might provide the standard boilerplate code for some stuff that we like and want to maintain ourselfes. as without this, we would think mlr3 is less useful.

I want to add more on that topic later, it is important

also note that this does not only concern learning algorithms. there are objects like measures, preproc steps, plots, for which the same type of discussion can be had

from mlr3.

berndbischl commented on August 24, 2024

Use correct level of abstraction. Here I strongly believe we have to stick to matrices (dense and sparse from Matrix pkg) as it's done in scikit-learn. On top of that we may implement "transformers" to construct design matrices from data.frames.

i am really REALLY unsure whether that is correct. and this goes angainst what we currently have. please argue very carefully and in detail here if you really believe that.

from mlr3.

amueller commented on August 24, 2024

My personal view (and not the view of the sklearn community as a whole) is that users would be better off if sklearn estimators supported dataframes more natively.

Max Kuhn is also working on a new ml API, kind of a successor of carret, that might be interesting to look at.

from mlr3.

amueller commented on August 24, 2024

https://github.com/topepo/parsnip

from mlr3.

dselivanov commented on August 24, 2024

Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.

There are many ways how to encode categorical variables, impute missing values, etc. One example from @pfistfl above - text data. You need to encode it as sparse matrix. And then pass it to some downstream solver which works with matrices.

This is part of modeling and should be reproducible - easy apply to new data. The more details hidden - the more confusion possible.

If there is demand for API for casual users - this can be build on top. At the end we all need to keep in mind that packages are build not by casual users. The success of such a meta-package depends on the level of adoption across developer community.

from mlr3.

mllg commented on August 24, 2024

Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.

I agree. However, there are some (important) algorithms which can deal with features being either numeric or factors. This is true for all tree-based methods. Converting to a sparse matrix is feasible here via dummy encoding, but would likely have a negative impact on both prediction performance and computational performance. So I strongly object using matrices as the default format to store data.

I'm not saying that we do not want support for sparse data. This is why the data backend is exchangeable. It took me ~15 mins to prototype a backend for sparse data here: https://github.com/mlr-org/mlr3/tree/sparse_backend

This is not well tested and not completely integrated yet. However, in the future you can tag learning algorithms with "sparse" to automatically handle format conversions:

If the backend is sparse and the learner is capable of working on sparse data: no conversion
If the backend is a data.table and the learner needs sparse data: convert data.table -> sparse; we need something simple here. More advanced stuff could be done in a pipeline.
If the backend is a sparse and the learner needs a data.table: convert sparse -> data.table
[...]

from mlr3.

pfistfl commented on August 24, 2024

dt -> sparse: https://www.rdocumentation.org/packages/mltools/versions/0.3.5/topics/sparsify

We sort of need all preprocessing and pipeline steps to also support this.

from mlr3.

amueller commented on August 24, 2024

That's what I meant by "almost all" :-)

These are really important, though! And scikit-learn has a bunch of issues because this is now basically impossible by design. Also, at least in python, there are no column names in matrices, and in most cases it's helpful to have semantic column names.

Most of the ML algorithms work with matrices. So the transformation step data.frame->matrix is almost always mandatory. And my point is that it should be explicit.

I'm all for explicit. But I don't think being explicit is in conflict with using dataframes. Doing one-hot encoding for example discards which dummy encode the same categorical variable and that makes it basically impossible to do semantically meaningful feature selection afterwards.

from mlr3.

jangorecki commented on August 24, 2024

I haven't made closer look at mentioned sparse data structure, but while ago I made one, built on top of data.table. Simply by storing (sparse) multidimensional data as table modelled in star schema. Access is very fast because table is always sorted. https://gitlab.com/jangorecki/data.cube

Regarding data.table-matrix conversion, I am pretty sure there is a space for improvement here. So far both methods are implemented in R, pushing down to C shouldn't be very difficult, also using parallelism (by column) should be possible.

from mlr3.

Masutani commented on August 24, 2024

from mlr3.

mllg commented on August 24, 2024

Do you plan to do this sparse data support for end 2 end ? If it involves sparse -> dense conversion internally or explicitly outside means memory explosion. So I prefer native support of sparse matrix end 2 end.

If the backend stores the data in a sparse format, and the model accepts sparse data, there will be no internal sparse->dense conversion.

However, there is no standard sparse format in R. E.g., while most models support the Matrix sparse format, xgboost implements its very own format so a conversion is unavoidable here.
We could implement a backend especially for xgboost though, but then you would also need to create a task just for xgboost.

ScikitLearn supports Sparse DataFrame, how about you also support sparse data.table ?

I'm not sure what you have in mind. Do you mean the suggested data.cube?

from mlr3.

dselivanov commented on August 24, 2024

I would disagree. Matrix is the standard ср, 24 июл. 2019 г., 12:55 Michel Lang <[email protected]>:

…

Do you plan to do this sparse data support for end 2 end ? If it involves sparse -> dense conversion internally or explicitly outside means memory explosion. So I prefer native support of sparse matrix end 2 end. If the backend stores the data in a sparse format, and the model accepts sparse data, there will be no internal sparse->dense conversion. However, there is no standard sparse format in R. E.g., while most models support the Matrix sparse format, xgboost implements its very own format so a conversion is unavoidable here. We could implement a backend especially for xgboost though, but then you would also need to create a task just for xgboost. ScikitLearn supports Sparse DataFrame, how about you also support sparse data.table ? I'm not sure what you have in mind. Do you mean the suggested data.cube? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#45?email_source=notifications&email_token=ABHC5XMSLSPRIHCENZYXQPTQBAKJFA5CNFSM4FZNAB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2VU4ZA#issuecomment-514543204>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABHC5XLNQSTEU77MWPWPSMTQBAKJFANCNFSM4FZNAB6Q> .

from mlr3.

dselivanov commented on August 24, 2024

But having say that I need to point that different models expect input in CSC or CSR formats... Anyway in R world dgCMatrix is de facto a standard. ср, 24 июл. 2019 г., 12:56 Dmitriy Selivanov <[email protected]>:

…

I would disagree. Matrix is the standard ср, 24 июл. 2019 г., 12:55 Michel Lang ***@***.***>: > Do you plan to do this sparse data support for end 2 end ? If it involves > sparse -> dense conversion internally or explicitly outside means memory > explosion. So I prefer native support of sparse matrix end 2 end. > > If the backend stores the data in a sparse format, and the model accepts > sparse data, there will be no internal sparse->dense conversion. > > However, there is no standard sparse format in R. E.g., while most models > support the Matrix sparse format, xgboost implements its very own format > so a conversion is unavoidable here. > We could implement a backend especially for xgboost though, but then you > would also need to create a task just for xgboost. > > ScikitLearn supports Sparse DataFrame, how about you also support sparse > data.table ? > > I'm not sure what you have in mind. Do you mean the suggested data.cube? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#45?email_source=notifications&email_token=ABHC5XMSLSPRIHCENZYXQPTQBAKJFA5CNFSM4FZNAB62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2VU4ZA#issuecomment-514543204>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABHC5XLNQSTEU77MWPWPSMTQBAKJFANCNFSM4FZNAB6Q> > . >

from mlr3.

jangorecki commented on August 24, 2024

If there is standard sparse matrix class then best would be to provide fast C conversion between data.table or matrix to such class. Another thing is that such sparse class (a package) should be lightweight, and be committed to stay lightweight.

from mlr3.

mllg commented on August 24, 2024

Well, it would be nice to have some sparse data format which can also store categorical features natively. If there would be an inexpensive conversion to dgCMatrix (e.g., after one-hot encoding the categorical features), we could do this on demand and it would be less code to maintain in mlr3.

from mlr3.

[discussion] scope of mlr3 about mlr3 HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent