Problem Deion Right now, skrub's T

Summarizing the meeting discussion, two possibility: If all tr

Better parallelism for TableVectorizer about skrub HOT 5 CLOSED

LeoGrin commented on September 26, 2024

Better parallelism for TableVectorizer

from skrub.

Comments (5)

GaelVaroquaux commented on September 26, 2024

The reason that the ColumnTransform creates one job per transformer is that some transformers can be multivariate (for instance a PCA, or a feature selection=.

I do see your point that parallel computing could be much improved by special casing a few encoders, such as the GapEncoder that must be parallelized.

The challenge in my eyes is: how to do this. One approach is to override the "_iter" method of the ColumnTransformer, in a way similar to (pseudo-code, won't run):

UNIVARIATE_TRANSFORMERS = (GapEncoder, MinHashEncoder)

...

   def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
        for (name, trans, columns, get_weight(name)) in ColumnTransformer._iter(self, fitted=fitted, replace_strings=replace_strings, column_as_strings=column_as_strings)
               if isinstance(trans, UNIVARIATE_TRANSFORMERS):
                      for column in columns:
                           yield (name, trans, (column, ), get_weight(name))
               else:
                     yield (name, trans, columns, get_weight(name))

This will need to be very extensively tested, as we are going to be toying with internals (_iter is a private function, and we are clearly putting our fingers a bit deep inside scikit-learn's private code).

from skrub.

LeoGrin commented on September 26, 2024

Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually. I'm wondering how simple it is to combine this and the ColumnTransformer's parallelism (doesn't seem to work very well when I do it naively on current TableVectorizer).
What do you think? Here's the pseudo-code I have in mind:

for (name, trans, columns) in self.transformers:
    if trans.has_attribute("n_jobs"):
         trans.n_jobs = len(columns) #assuming we have a lot of cores, should be set better
self.n_jobs = #override if necessary

from skrub.

GaelVaroquaux commented on September 26, 2024

Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually.

I don't think that this would work terribly well: it creates nested parallelism, with barriers, and would probably lead to much starvation.

from skrub.

LeoGrin commented on September 26, 2024

If we want to avoid nested parallelism, something which would be very simple while still being an improvement is to set the TableVectorizer n_jobs to 1, and pass the n_jobs argument to all transformers which have this attribute. This would be faster than what we have now (usually more columns of each type than transformers). And in the current situation, where high-cardinality transformers are much slower than the other transformers, it should be close to your solution in term of performance. What do you think about implementing this right now, and eventually going back to your solution if the situation changes?

from skrub.

LeoGrin commented on September 26, 2024

Summarizing the meeting discussion, two possibility:

If all transformers have n_jobs attributes and their n_jobs=None, pass the n_jobs arguments to the transformers and set TableVectorizer.n_jobs to 1. Pro: Benefits from some encoders parallelism (e.g MinHashEncoder). Cons: surprising for the user.
Overrite the "_iter" method. Pro: less surprising, more jobs being created. Cons: Dependance on a sklearn private function.

The second method was chosen.

from skrub.

Better parallelism for TableVectorizer about skrub HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent