Giter Site home page Giter Site logo

scikit-adaptation / skada Goto Github PK

View Code? Open in Web Editor NEW
58.0 7.0 16.0 2.34 MB

Domain adaptation toolbox compatible with scikit-learn and pytorch

Home Page: https://scikit-adaptation.github.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
domain-adaptation data-shift sklearn

skada's Introduction

SKADA - Domain Adaptation with scikit-learn and PyTorch

PyPI version Build Status Codecov Status License DOI

Warning

This library is currently in a phase of active development. All features are subject to change without prior notice. If you are interested in collaborating, please feel free to reach out by opening an issue or starting a discussion.

SKADA is a library for domain adaptation (DA) with a scikit-learn and PyTorch/skorch compatible API with the following features:

  • DA estimators and transformers with a scikit-learn compatible API (fit, transform, predict).
  • PyTorch/skorch API for deep learning DA algorithms.
  • Classifier/Regressor and data Adapter DA algorithms compatible with scikit-learn pipelines.
  • Compatible with scikit-learn validation loops (cross_val_score, GridSearchCV, etc).

Citation: If you use this library in your research, please cite the following reference:

Gnassounou T., Kachaiev O., Flamary R., Collas A., Lalou Y., de Mathelin A., Gramfort A., Bueno R., Michel F., Mellot A.,  Loison V., Odonnat A., Moreau T. (2024). SKADA : Scikit Adaptation (version 0.3.0). URL: https://scikit-adaptation.github.io/

or in Bibtex format :

@misc{gnassounou2024skada,
author = {Gnassounou, Théo and Kachaiev, Oleksii and Flamary, Rémi and Collas, Antoine and Lalou, Yanis and de Mathelin, Antoine and Gramfort, Alexandre and Bueno, Ruben and Michel, Florent and Mellot, Apolline and  Loison, Virginie and Odonnat, Ambroise and Moreau, Thomas},
month = {7},
title = {SKADA : Scikit Adaptation},
url = {https://scikit-adaptation.github.io/},
year = {2024}
}

Implemented algorithms

The following algorithms are currently implemented.

Domain adaptation algorithms

  • Sample reweighting methods (Gaussian [1], Discriminant [2], KLIEPReweight [3], DensityRatio [4], TarS [21], KMMReweight [23])
  • Sample mapping methods (CORAL [5], Optimal Transport DA OTDA [6], LinearMonge [7], LS-ConS [21])
  • Subspace methods (SubspaceAlignment [8], TCA [9], Transfer Subspace Learning [27])
  • Other methods (JDOT [10], DASVM [11], OT Label Propagation [28])

Any methods that can be cast as an adaptation of the input data can be used in one of two ways:

  • a scikit-learn transformer (Adapter) which provides both a full Classifier/Regressor estimator
  • or an Adapter that can be used in a DA pipeline with make_da_pipeline. Refer to the examples below and visit the galleryfor more details.

Deep learning domain adaptation algorithms

  • Deep Correlation alignment (DeepCORAL [12])
  • Deep joint distribution optimal (DeepJDOT [13])
  • Divergence minimization (MMD/DAN [14])
  • Adversarial/discriminator based DA (DANN [15], CDAN [16])

DA metrics

  • Importance Weighted [17]
  • Prediction entropy [18]
  • Soft neighborhood density [19]
  • Deep Embedded Validation (DEV) [20]
  • Circular Validation [11]

Installation

The library is not yet available on PyPI. You can install it from the source code.

pip install git+https://github.com/scikit-adaptation/skada

Short examples

We provide here a few examples to illustrate the use of the library. For more details, please refer to this example, the quick start guide and the gallery.

First, the DA data in the SKADA API is stored in the following format:

X, y, sample_domain 

Where X is the input data, y is the target labels and sample_domain is the domain labels (positive for source and negative for target domains). We provide below an example ho how to fit a DA estimator:

from skada import CORAL

da = CORAL()
da.fit(X, y, sample_domain=sample_domain) # sample_domain passed by name

ypred = da.predict(Xt) # predict on test data

One can also use Adapter classes to create a full pipeline with DA:

from skada import CORALAdapter, make_da_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = make_da_pipeline(StandardScaler(), CORALAdapter(), LogisticRegression())

pipe.fit(X, y, sample_domain=sample_domain) # sample_domain passed by name

Please note that for Adapter classes that implement sample reweighting, the subsequent classifier/regressor must require sample_weights as input. This is done with the set_fit_requires method. For instance, with LogisticRegression, you would use LogisticRegression().set_fit_requires('sample_weight'):

from skada import GaussianReweightAdapter, make_da_pipeline
pipe = make_da_pipeline(GaussianReweightAdapter(),
                        LogisticRegression().set_fit_request(sample_weight=True))

Finally SKADA can be used for cross validation scores estimation and hyperparameter selection :

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from skada import CORALAdapter, make_da_pipeline
from skada.model_selection import SourceTargetShuffleSplit
from skada.metrics import PredictionEntropyScorer

# make pipeline
pipe = make_da_pipeline(StandardScaler(), CORALAdapter(), LogisticRegression())

# split and score
cv = SourceTargetShuffleSplit()
scorer = PredictionEntropyScorer()

# cross val score
scores = cross_val_score(pipe, X, y, params={'sample_domain': sample_domain}, 
                         cv=cv, scoring=scorer)

# grid search
param_grid = {'coraladapter__reg': [0.1, 0.5, 0.9]}
grid_search = GridSearchCV(estimator=pipe,
                           param_grid=param_grid,
                           cv=cv, scoring=scorer)

grid_search.fit(X, y, sample_domain=sample_domain)

Acknowledgements

This toolbox has been created and is maintained by the SKADA team that includes the following members:

License

The library is distributed under the 3-Clause BSD license.

References

[1] Shimodaira Hidetoshi. "Improving predictive inference under covariate shift by weighting the log-likelihood function." Journal of statistical planning and inference 90, no. 2 (2000): 227-244.

[2] Sugiyama Masashi, Taiji Suzuki, and Takafumi Kanamori. "Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation." Annals of the Institute of Statistical Mathematics 64 (2012): 1009-1044.

[3] Sugiyama Masashi, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul Von Bünau, and Motoaki Kawanabe. "Direct importance estimation for covariate shift adaptation." Annals of the Institute of Statistical Mathematics 60 (2008): 699-746.

[4] Sugiyama Masashi, and Klaus-Robert Müller. "Input-dependent estimation of generalization error under covariate shift." (2005): 249-279.

[5] Sun Baochen, Jiashi Feng, and Kate Saenko. "Correlation alignment for unsupervised domain adaptation." Domain adaptation in computer vision applications (2017): 153-171.

[6] Courty Nicolas, Flamary Rémi, Tuia Devis, and Alain Rakotomamonjy. "Optimal transport for domain adaptation." IEEE Trans. Pattern Anal. Mach. Intell 1, no. 1-40 (2016): 2.

[7] Flamary, R., Lounici, K., & Ferrari, A. (2019). Concentration bounds for linear monge mapping estimation and optimal transport domain adaptation. arXiv preprint arXiv:1905.10155.

[8] Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision (pp. 2960-2967).

[9] Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2010). Domain adaptation via transfer component analysis. IEEE transactions on neural networks, 22(2), 199-210.

[10] Courty, N., Flamary, R., Habrard, A., & Rakotomamonjy, A. (2017). Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30.

[11] Bruzzone, L., & Marconcini, M. (2009). Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE transactions on pattern analysis and machine intelligence, 32(5), 770-787.

[12] Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14 (pp. 443-450). Springer International Publishing.

[13] Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018). Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV) (pp. 447-463).

[14] Long, M., Cao, Y., Wang, J., & Jordan, M. (2015, June). Learning transferable features with deep adaptation networks. In International conference on machine learning (pp. 97-105). PMLR.

[15] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., ... & Lempitsky, V. (2016). Domain-adversarial training of neural networks. Journal of machine learning research, 17(59), 1-35.

[16] Long, M., Cao, Z., Wang, J., & Jordan, M. I. (2018). Conditional adversarial domain adaptation. Advances in neural information processing systems, 31.

[17] Sugiyama, M., Krauledat, M., & Müller, K. R. (2007). Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5).

[18] Morerio, P., Cavazza, J., & Murino, V. (2017). Minimal-entropy correlation alignment for unsupervised deep domain adaptation. arXiv preprint arXiv:1711.10288.

[19] Saito, K., Kim, D., Teterwak, P., Sclaroff, S., Darrell, T., & Saenko, K. (2021). Tune it the right way: Unsupervised validation of domain adaptation via soft neighborhood density. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9184-9193).

[20] You, K., Wang, X., Long, M., & Jordan, M. (2019, May). Towards accurate model selection in deep unsupervised domain adaptation. In International Conference on Machine Learning (pp. 7124-7133). PMLR.

[21] Zhang, K., Schölkopf, B., Muandet, K., Wang, Z. (2013). Domain Adaptation under Target and Conditional Shift. In International Conference on Machine Learning (pp. 819-827). PMLR.

[22] Loog, M. (2012). Nearest neighbor-based importance weighting. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE (https://arxiv.org/pdf/2102.02291.pdf)

[23] Domain Adaptation Problems: A DASVM ClassificationTechnique and a Circular Validation StrategyLorenzo Bruzzone, Fellow, IEEE, and Mattia Marconcini, Member, IEEE (https://rslab.disi.unitn.it/papers/R82-PAMI.pdf)

[24] Loog, M. (2012). Nearest neighbor-based importance weighting. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE (https://arxiv.org/pdf/2102.02291.pdf)

[25] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf and A. J. Smola. Correcting sample selection bias by unlabeled data. In NIPS, 2007. (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=07117994f0971b2fc2df95adb373c31c3d313442)

[26] Long, M., Wang, J., Ding, G., Sun, J., and Yu, P. (2014). Transfer joint matching for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1410–1417

[27] S. Si, D. Tao and B. Geng. In IEEE Transactions on Knowledge and Data Engineering, (2010) Bregman Divergence-Based Regularization for Transfer Subspace Learning

[28] Solomon, J., Rustamov, R., Guibas, L., & Butscher, A. (2014, January). Wasserstein propagation for semi-supervised learning. In International Conference on Machine Learning (pp. 306-314). PMLR.

[29] Montesuma, Eduardo Fernandes, and Fred Maurice Ngole Mboula. "Wasserstein barycenter for multi-source domain adaptation." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16785-16793. 2021.

[30] Gnassounou, Theo, Rémi Flamary, and Alexandre Gramfort. "Convolution Monge Mapping Normalization for learning on sleep data." Advances in Neural Information Processing Systems 36 (2024).

[31] Redko, Ievgen, Nicolas Courty, Rémi Flamary, and Devis Tuia. "Optimal transport for multi-source domain adaptation under target shift." In The 22nd International Conference on artificial intelligence and statistics, pp. 849-858. PMLR, 2019.

[32] Hu, D., Liang, J., Liew, J. H., Xue, C., Bai, S., & Wang, X. (2023). Mixed Samples as Probes for Unsupervised Model Selection in Domain Adaptation. Advances in Neural Information Processing Systems 36 (2024).

skada's People

Contributors

agramfort avatar ambroiseodt avatar antoinecollas avatar antoinedemathelin avatar apmellot avatar buenoruben avatar florent-michel avatar kachayev avatar rflamary avatar tgnassou avatar tommoral avatar vloison avatar yanislalou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

skada's Issues

Update readme and add all references to solver and estimators

We should begin to list the implemented approaches and have a proper reference list. I like the way we did this in POT with a numbered list in the README and unique numbers in the documentation:
https://github.com/PythonOT/POT

We can start the readme with a small description of the toolbox and a short list/itemize of implemented methods grouped by theme (reweighing methods, mapping, deep divergence, Others ...). We should also provide a short example of how t use the methods (adapter/pipeline and DA Estimator) and move some of the current details in the README in a quick start guide.

WDYT?

Accuracy of 0.00 in generated example plot_cross_val_score_for_da.html

Last doc build: https://output.circle-artifacts.com/output/job/d77ed2a4-3c2a-48e2-838d-75bc08bb00a7/artifacts/0/dev/auto_examples/validation/plot_cross_val_score_for_da.html#sphx-glr-auto-examples-validation-plot-cross-val-score-for-da-py

In this example we compare 2 estimators for a binary classification task.
The first estimator uses DA methods and the second one doesn't.

The 1st estimator has an accuracy of 0.93.
The 2nd estimator has an accuracy of 0.00.

There is clearly a mistake for the 2nd estimator knowing that we're doing a binary classification task.

Allow pack() functions to accept strings as input for as_sources and as_targets args

Currently, every DAdataset packing function is only compatible with List[str] for the as_sources and as_targets arguments. However, in cases where we have only one target and one source, it would be simpler to use a single string instead of a list of strings.
OR, the use of strings altogether could be prohibited.
Right now, when we provide a string instead of a list, the packing functions iterate over the string, letter by letter, eventually raising a KeyError exception."

Consider adding tabular dataset

For example, the one with personal/business flights I've been experimenting with.

It would be nice to have more than CV provided out-of-the box.

Weights normalization in ReweightDensity

Hi everyone,
I wonder why the weight are normalized in ReweightDensity cf

source_weights /= source_weights.sum()
?

After some tests, I observed that the strange behavior of ReweightDensity in the comparison example is due to the weight normalization. cf. https://scikit-adaptation.github.io/auto_examples/plot_method_comparison.html#sphx-glr-auto-examples-plot-method-comparison-py

Maybe, we should remove the normalization or at least divide by the mean instead of the sum (dividing by the sum provokes a dependance of the weights with respect to the dataset size, which should be avoided, I think).

Cleanup `DomainAwareDataset` API

  • review methods and naming convention, make sure all methods are necessary
  • write docstrings
  • test cases, specifically to cover the functionality of adding new domains, getting domain and selecting multiple domains
  • test cases to cover failure modes, like overlapping names

Multidomain source target split and merge

Add the ability to have:

  • source_target_merge from utils.py to accept *List[arrays] with each list corresponding to one sample domain. Ex: List[X_domain_1, y_domain_1], List[X_domain_2, y_domain_2] ...
  • source_target_split from utils.py to return *List[arrays] with each list corresponding to one sample domain.

Implement Kernel Mean Matching

Hi everyone,
I am glad to see that the skada repo is already public. I really like the API choices that have been made (the pipeline idea is great!)
I am opening this issue to propose the implementation of the Kernel Mean Matching reweighting method (cf. “Correcting sample selection bias by unlabeled data.” paper.

Are you ok to add it to the library ? If yes, I can open a PR.

Bug when using a da_pipeline inside a da_pipeline (I know)

When running this simple code that should run (basically using CORAL that is itself a DA pipeline)

from skada import make_da_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# create a DA pipeline
pipe = make_da_pipeline(StandardScaler(),CORAL(base_estimator=SVC()))
pipe.fit(X, y, sample_domain=sample_domain)

we have the following error

AttributeError                            Traceback (most recent call last)
[~/PYTHON/skada/examples/plot_how_to_use_skada.py](https://file+.vscode-resource.vscode-cdn.net/home/rflamary/PYTHON/skada/~/PYTHON/skada/examples/plot_how_to_use_skada.py) in <cell line: 15>()
     13 # create a DA pipeline
     14 pipe = make_da_pipeline(StandardScaler(),CORAL(base_estimator=SVC()))
---> 15 pipe.fit(X, y, sample_domain=sample_domain)
     16 
     17 print('Accuracy on target:',pipe.score(Xt,yt))

[~/.local/lib/python3.10/site-packages/sklearn/base.py](https://file+.vscode-resource.vscode-cdn.net/home/rflamary/PYTHON/skada/~/.local/lib/python3.10/site-packages/sklearn/base.py) in wrapper(estimator, *args, **kwargs)
   1192                 )
   1193             ):
-> 1194                 return fit_method(estimator, *args, **kwargs)
   1195 
   1196         return wrapper

[~/.local/lib/python3.10/site-packages/sklearn/pipeline.py](https://file+.vscode-resource.vscode-cdn.net/home/rflamary/PYTHON/skada/~/.local/lib/python3.10/site-packages/sklearn/pipeline.py) in fit(self, X, y, **params)
    467             Pipeline with fitted steps.
    468         """
--> 469         routed_params = self._check_method_params(method="fit", props=params)
    470         Xt = self._fit(X, y, routed_params)
    471         with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):

[~/.local/lib/python3.10/site-packages/sklearn/pipeline.py](https://file+.vscode-resource.vscode-cdn.net/home/rflamary/PYTHON/skada/~/.local/lib/python3.10/site-packages/sklearn/pipeline.py) in _check_method_params(self, method, props, **kwargs)
    353     def _check_method_params(self, method, props, **kwargs):
...
--> 165         request.fit.add_request(param='sample_domain', alias=True)
    166         request.transform.add_request(param='sample_domain', alias=True)
    167         request.predict.add_request(param='sample_domain', alias=True)

AttributeError: 'MetadataRouter' object has no attribute 'fit'

interestingly the code above works for a classical sklearn make_pipeline but not ours.

Split helper functions for input validation and reshaping

As of now we have the following input validation helpers:

  • check_X_y_domain
  • check_X_domain

They are designed to be used similarly to sklearn functions like utils.check_X_y (link) with an advantage of understanding sample_domain parameter (or inferring it when necessary).

Right now they provide multiple different output shapes, like the option to split sources vs. target or return only domain indices. What need to be done is the following:

  • check_X_y_domain always returns a tuple of X, y, sample_domain, check_X_domain, respectively, always returns X, sample_domain
  • both are moved to a publicly available namespace utils instead of internal _utils (to match sklearn counterparts)
  • separately, new utils functions are introduced to cover up previously available functionality (split sources vs. targets, extracting indices, etc)

Don't forget about proper docstrings and tests coverage.

Merge all OT mapping methods in one class

Should we merge OT mapping methods?

it seems like the use of entropic regularization or group lasso could be added easily as a parameter to the same class. We should probably keep LinearOTMapping separate because it is very different in term of mapping but the OTMapping, EntropicOTmapping and Grouplasso could be merged in one class in my opinion and it would make SKADA less OT biased ;).

Another question is if we should move them in _ot.py . on this one I'm a little more circumspect since those are mapping method and they probably belong with CORAL...

Make sure all API methods accept `sample_domain` as `None`

In this case the method has to derive domain labels using y masking. If masking is not done or, in a way, ambiguous, we should throw ValueError to report inconsistent inputs.

Typically done with allow_auto_sample_domain=True param for check_*_domain utils. I'm not sure this is implemented consistently throughout all methods. Also, test cases to cover this are absolutely necessary.

Selector to propagate params to the estimators

Otherwise, we need to do something like the following:

grid_search = GridSearchCV(
    estimator,
    {"entropicotmappingadapter__base_estimator__reg_e": reg_e},
    cv=cv,
    scoring=PredictionEntropyScorer(),
)

Should be

grid_search = GridSearchCV(
    estimator,
    {"entropicotmappingadapter__reg_e": reg_e},
    cv=cv,
    scoring=PredictionEntropyScorer(),
)

instead. See examples.

Add test selector/filter for “slow” tests

Good example is the test suite for Office31, as it first downloads the dataset from the external source.

We need to mark those tests as "slow", so we can run pytest with the configuration to omit such tests from the run.

KLIEP pipeline does not use the sample_weight returned by KLIEPAdapter when fitting the estimator

Hi everyone,

I think there is an issue with KLIEP. It provides the same results as the estimator fitted without sample_weight. It seems that the sample_weight returned by KLIEPAdapter are not given as argument of the fit method of the estimator. Here is a little example showing that KLIEP provides the same predictions as the estimator without weighting, while it should provide the same results as the estimator fitted with the weight returned by KLIEPAdapter:

import numpy as np
from sklearn.linear_model import LinearRegression
from skada import KLIEP, KLIEPAdapter
from skada.datasets import DomainAwareDataset

X_source = np.array([[-1.],
                     [0.],
                     [1.]])

X_target = np.array([[1.]])

y_source = np.abs(X_source.ravel())
y_target = np.ones(1)

dataset = DomainAwareDataset([(X_source, y_source, 's'), (X_target, y_target, 't')])
X, y, sample_domain = dataset.pack_train(as_sources=['s'], as_targets=['t'])

pipeline_model = KLIEP(base_estimator=LinearRegression(), gamma=1.)

kliep_adapt = KLIEPAdapter(gamma=1.)
kliep_estimator = LinearRegression()
noweighting_estimator = LinearRegression()

pipeline_model.fit(X, y, sample_domain=sample_domain)

kliep_adapt.fit(X, y, sample_domain=sample_domain)
weighting_outputs = kliep_adapt.transform(X, y, sample_domain=sample_domain, allow_source=True)

sample_weight = weighting_outputs["sample_weights"]

source_idx = ~np.isnan(y)
kliep_estimator.fit(X[source_idx], y[source_idx], sample_weight[source_idx])

noweighting_estimator.fit(X[source_idx], y[source_idx])

print("Pipeline Model preds:", pipeline_model.predict(X_target))
print("No Weighting + Estimator preds:", noweighting_estimator.predict(X_target))
print("KLIEP Weighting + Estimator preds:", kliep_estimator.predict(X_target))

>>> Pipeline Model preds: [0.66666667]
>>> No Weighting + Estimator preds: [0.66666667]
>>> KLIEP Weighting + Estimator preds: [0.96991182]

PS : I think, the issue is not KLIEP specific but is encountered by all reweighting classes.

Change target_labels name for the SupervisedScorer

In this example:

X, y, sample_domain = da_dataset.pack_train(as_sources=['s'], as_targets=['t'])
estimator = make_da_pipeline(
    ReweightDensityAdapter(),
    LogisticRegression().set_score_request(sample_weight=True),
)
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
_, target_labels, _ = da_dataset.pack(as_sources=['s'], as_targets=['t'], train=False)
scoring = SupervisedScorer()
scores = cross_validate(
    estimator,
    X,
    y,
    cv=cv,
    params={'sample_domain': sample_domain, 'target_labels': target_labels},
    scoring=scoring,
)

The target_labels corresponds to the labels of the SOURCE + TARGET samples.
Calling this array target_labels is super confusing...

We could call it unmasked_labels, all_labels, source_target_labels or even y_test.

`Shared` selector should reject test time domain not seen during fitting

This functionality was planned... but was not implemented from what I can tell.

It seems like a good safety guard. It is also important for those algorithms that rely on a test-time fitting (it would make sure that a dedicated API used for this avoiding any confusion).

The only question that might be slightly annoying: what if we fit the pipeline from a single source/single target but with non-default domain labels, like say {-5, 7}, when running predict w/o sample_domain input this will fail (because we assume predict inputs to be targets with a default label). I guess it's okay, in a way you created the problem for yourself. To be sure that what you put into predict is correct, it has to have domain labels attached to it.

Implement any mutli-source and/or multi-target DA method

This is required for us to understand how the API "feels" when working with multi-source multi-target settings. It's also crucial for properly covering the functionality of selectors (note that as of now we only have basic Shared selector).

Can't use SupervisedScorer when using X, y, domain = dataset.pack_lodo

When using LeaveOneDomainOut we want to have a cross validation by modifying the source and target domains at each split.
Thus we need to use the dataset.pack_lodo method since its the only one not asking for a source and a target domain.

Now if we want to use the SupervisedScorer, it asks a target_labels argument to generate a score.
However when we use the pack_lodo function we don't have the target_labels.

Thus we can't use SupervisedScorer when using LeaveOneDomainOut.
And by leaving the param target_labels empty in cross_validate , we get ValueErrors.

[BUG] ValueError raised when using the GaussianReweightDensityAdapter

Using this code snippet:

from skada import GaussianReweightDensityAdapter, make_da_pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from skada.datasets import fetch_office31_surf_all

domain_dataset_office31 = fetch_office31_surf_all()

pipeline = make_da_pipeline(
    GaussianReweightDensityAdapter(),
)

X, y, sample_domain = domain_dataset_office31.pack_lodo()
pipeline.fit(X=X, y=y, sample_domain=sample_domain)

You'll get this error:
ValueError: Found array with 0 sample(s) (shape=(0, 800)) while a minimum of 1 is required by StandardScaler.

Also its worth noting that by adding a LogisticRegression at the end of the pipeline, the error magically disappears.

Rename dataset method `pack_flatten` to `pack_for_lodo`

See skada.datasets.DomainAwareDataset. The method is quite edgy, as of now it's sole purpose to pack the dataset for a LeaveOneDomainOut splitter.

Rename and write docstring with the proper explanation of what this method is doing (and when it should be used).

Allow for (List[np.ndarray], List[np.ndarray], List[str]) as an init arg for DomainAwareDataset

Rn DomainAwareDataset accepts:

 def __init__(
        self,
        # xxx(okachaiev): not sure if dictionary is a good format :thinking:
        domains: Union[List[DomainDataType], Dict[str, DomainDataType], None] = None
    ):

with

DomainDataType = Union[
    # (name, X, y)
    Tuple[str, np.ndarray, np.ndarray],
    # (X, y)
    Tuple[np.ndarray, np.ndarray],
    # (X,)
    Tuple[np.ndarray, ],
]

I feel like it would be easier to also be able to init a DomainAwareDataset like that:

sample_domain = [1, 1, 1, 2, 2]
X = [0, 1, 0, 1, 2]
y = [0, 1, 0, 0, 1]

dataset = DomainAwareDataset(X, y, sample_domain)

Stacking multiple adapters in the pipeline results in confusing behavior

The reason for that is that many adapters are implemented in such a way that they learn the transformation from a source to a target domain, transform the source and pass it through the pipeline to fit the estimator (final step in the pipeline). Now, if we put multiple adapters in a row, every adapter after the first one won't see the target data.

Potential solution for this problem is to pass special flag into the selector when we create pipeline to mark if given transformation is "final" or not (i.e. it precedes the final estimator in the pipeline). Based on that, the selector can make a decision if targets should be propagated forward.

make_da_pipeline does not remove target labels during estimator fit

Hi everyone,

When looking at the plot_label_comparison example, I observe very accurate predictions of the DA methods for cases where DA should fail. After some investigation, it seems that the estimator is fitted on both source and target labels, while it should be fitted only on source. I made a little example with KLIEP highlighting this behavior:

import numpy as np
from sklearn.linear_model import LinearRegression
from skada import KLIEP

X_source = np.ones(10)
X_target = X_source

y_source = np.ones(10)
y_target = np.zeros(10)

X = np.concatenate((X_source, X_target))
y = np.concatenate((y_source, y_target))
sample_domain = np.ones(X.shape[0])
sample_domain[X_source.shape[0]:] *= -2

lr = LinearRegression()

kliep = KLIEP(base_estimator=lr)

kliep.fit(X.reshape(-1, 1), y, sample_domain=sample_domain)
kliep.predict(X.reshape(-1, 1))

>>> array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

Here, the source and target inputs are the same X_source = X_target while the labels are different y_source = 1 and y_target = 0. If, the target labels are unknown, kliep should predict 1, while it predicts 0.5 here. I suspect, then, that it has fitted the linear regressor on target labels too.

When looking at the code of make_da_pipeline I do not see where the target labels are removed before fitting the estimator ?

Cleanup `check_*_domain` utils

The functions are already present in the code, so you can review them to understand their purposes. Given that they contain a significant number of internal clauses, it might be more straightforward to comprehend their functionality by reading the code directly.

  • check_X_y_domain [link]
  • check_X_domain [link]

The overall purpose of these functions is similar to the check_array function in sklearn. Depending on the provided flags, they can execute a wide range of checks. In our case, it’s crucial to ensure that all the checks are: a) necessary, meaning they make sense and are utilized in the code; b) implemented correctly, which we can verify by creating test cases for each check; and c) properly documented in the docstrings, with sklearn serving as an excellent reference for how to structure these. Additionally, we need to relocate these functions to a publicly accessible utils.py module, as they are intended for both internal use and external use (by anyone developing their own estimators).

Add splitter function `source_target_split`

We need a function available from root of skadawith the following use case

X_s, y_s, X_t, y_t = source_target_split(X,y,sample_domain)

X_s, y_s, domain_s, Xt, yt, domain_t = source_target_split(X,y,sample_domain, return_domains = True)

# it just uses the sign of sample_domain (or of y if sample domain is not provided) but can still return the domains when asked (when multiple/source are available)

For many visulationsation we need an easy way to split teh data that avoid using logical sleection EVERYWHERE when needed to split the data.

[API] fit, predict and other functions should accept X,y,smaple_weight as first parameters by default

In order to stay compatible with skleranAPI we want all estimatoer (and Adapters) to have functions that accept unnamed sample_weightby defalult.

For the moment we have for Adapter and DAEstimator:

    @abstractmethod
    def fit(self, X, y=None, sample_domain=None, *, sample_weight=None):
        """Fit adaptation parameters"""

I suggest we change it to

    @abstractmethod
    def fit(self, X, y=None, sample_weight=None sample_domain=None, *):
        """Fit adaptation parameters"""

that will allow to do things like

   model.fit( X, y, sample_weight, sample_domain):

the named sample_domain will still be necessary for pipelines

Polish `BaseSelector` API

There are quite a few things that should happen here:

  • consider converting BaseSelect into a meta-router (similar to how pipeline works)
  • [DONE] base class need to be re-defined (better to start from implementation of other selectors so we understand what functionality we need)
  • [DONE] to implement missing methods, like decision_function and others
  • [DONE] avoid code duplication between methods

Better docstrings, test coverage as usual.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.