Giter Site home page Giter Site logo

Comments (19)

jameschapman19 avatar jameschapman19 commented on June 18, 2024 1

Pushed to main just now (but not yet released as a new version)

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024 1

A headsup that I made a significant change to main last night which simplifies a few of the alternating optimisation methods like SCCA_PMD.

I had tried to link together the stochastic methods and iterative methods but the benefits didn't really materialize and the costs (simplicity/readability) were very high.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Ergh. Issue you raise is correct I thought I'd managed to remove a lot of code around the CV by leaning on sklearn more but looks like I need to revisit this for the reason you raise.

I had a similar conversation with someone at OHBM yesterday about having de confounding in a pipeline and if I can handle your case I can also handle that case.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Should also add the context that scikit-learn removed the scale and Center arguments from all their models in v1.0 and I essentially philosophically agree it's not 'part of the model' hence why I removed it too. I just didn't notice this particular problem. So in general the answer is yes I think user should implement scaling/centering/preprocessing it's just that the grid search currently uses an existing pipeline to get the splitting to work with sklearn

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

So this would work as a minimal example:

from cca_zoo.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from cca_zoo.linear import rCCA

import numpy as np

X=np.random.normal(size=(100,10))
Y=np.random.normal(size=(100,10))

cca=rCCA()
pipe=Pipeline([
    ('scaler', StandardScaler()),
    ('cca', cca)
])
param_grid = {
    "c":[0.1,0.2,0.3],
}
gs= GridSearchCV(model, param_grid, cv=2, verbose=1, n_jobs=1)
gs.fit((X,Y))

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on June 18, 2024

Should also add the context that scikit-learn removed the scale and Center arguments from all their models in v1.0 and I essentially philosophically agree it's not 'part of the model' hence why I removed it too. I just didn't notice this particular problem. So in general the answer is yes I think user should implement scaling/centering/preprocessing it's just that the grid search currently uses an existing pipeline to get the splitting to work with sklearn

Yup, I also agree that it was a smart idea to drop the argument. The model shouldn't do more than it's supposed to do. Sounds like one needs to implement a custom StandardScaler that can scale both X and y (and possibly even more matrices, when we think of multi-view models). Guess it would be smart to handle this in a separate preprocessing module.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

I will refer to the X,y in your example as X1, X2 for clarity because the y argument=None from the code perspective.

preprocessing module was my initial thought but possibly overcomplicates the codebase given most use cases.

Gridsearch
GridsearchCV (and all the CV modules) would be happy accepting a scikit-learn standard scaler as in your initial code on this thread because behind the scenes we concatenate X1 and X2 as a first step see here:

    def fit(self, X, y=None, *, groups=None, **fit_params):
        self.estimator = Pipeline(
            [
                ("splitter", SimpleSplitter([X_.shape[1] for X_ in X])),
                ("estimator", clone(self.estimator)),
            ]
        )
        super().fit(np.hstack(X), y=y, groups=groups, **fit_params)
        self.best_estimator_ = self.best_estimator_["estimator"]

So if you use a scaler in your model pipeline it will operate as if X1 and X2 were just a single matrix.

Normal model fitting
If I am not doing CV then a user can do their preprocessing outside of cca_zoo.

Tentative conclusion

I think a preprocessing module is the elegant solution and might be justified within this package if either:

  • it was a common use case to apply different preprocessing to X1 and X2 as part of CV
  • there were some preprocessing methods that aren't in scikit-learn and are particularly common or useful in CCA studies (deconfounding would probably come under this bracket)

I think both are true to a degree but don't want to add code for the sake of it (I've made that mistake enough XD)

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on June 18, 2024

Your provided example does not work. First of all, one has to provide param_grid as param_grid = {"cca__c":[0.1,0.2,0.3]} and not as param_grid = {"c":[0.1,0.2,0.3]}. But even if I do this, I get:

ValueError: Found array with dim 3. StandardScaler expected <= 2.

Also check this minimal example:

import numpy as np
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import rCCA
from cca_zoo.data.simulated import LinearSimulatedData
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

###############################################################################
## Settings ###################################################################
###############################################################################

n_jobs = 8
pre_dispatch = 3
rng = np.random.RandomState(43)

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

n = 100
p = 10
q = 100
latent_dims = 3
correlation = 0.9

data = LinearSimulatedData(
    view_features=[p, q],
    latent_dims=latent_dims,
    correlation=[0.9,0.8,0.7],
    structure='identity'
)

(X,y) = data.sample(n)

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 3

# define a search space (optimize left and right penalty parameters)
param_grid = {'cca__c':[list(np.arange(0.1,1.1,0.1)),0]}

# # define an estimator
estimator = Pipeline([
    ('scaler',StandardScaler()),
    ('cca',rCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

##############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=n_jobs,cv=5)
grid.fit([X,y])

Here, I get this error:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 80) + inhomogeneous part.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Looking at this today - I agree with the observation.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Looks like scikit-learn is also a bit clunky in this sort of setup e.g. the suggested solution here:

https://stackoverflow.com/questions/43366561/use-sklearns-gridsearchcv-with-a-pipeline-preprocessing-just-once

Implies data leakage. Just having a think how we can deal with this.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Ah ok I've got a hang of the problem now. I was wrong about being able to just apply the scaler in a pipe like this because the operations are like:

  1. stack the views into a single X so scikit-learn is happy to do CV splitting on them
  2. use a 'Splitter' transformer to split X back into views
  3. apply user pipe

I think then maybe the route is a transformer class where you can choose preprocessing for each view. I'll have a go and write something up here.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Cool so this is my proposed solution:

"""
Class which allows for the different (or the same) processing of multiple views of data.
"""
from mvlearn.utils import check_Xs
from sklearn.base import TransformerMixin
from sklearn.utils.validation import check_is_fitted

class MultiViewPreprocessing(TransformerMixin):
    def __init__(self, preprocessing_list):
        self.preprocessing_list = preprocessing_list

    def fit(self, views, y=None):
        """
        Fits the associated preprocessing steps to each view.
        Parameters
        ----------
        views
        y

        Returns
        -------

        """
        if len(self.preprocessing_list) == 1:
            self.preprocessing_list = self.preprocessing_list * len(views)
        elif len(self.preprocessing_list) != len(views):
            raise ValueError("Length of preprocessing_list must be 1 (apply the same preprocessing to each view) or equal to the number of views")
        check_Xs(views, enforce_views=range(len(self.preprocessing_list)))
        for view, preprocessing in zip(views, self.preprocessing_list):
            preprocessing.fit(view, y)
        return self

    def transform(self, X, y=None):
        """
        Transforms each view using the associated preprocessing steps.
        Parameters
        ----------
        X
        y

        Returns
        -------

        """
        [check_is_fitted(preprocessing) for preprocessing in self.preprocessing_list]
        check_Xs(X, enforce_views=range(len(self.preprocessing_list)))
        return [preprocessing.transform(view) for view, preprocessing in zip(X, self.preprocessing_list)]

Which works for me with the following script:

from cca_zoo.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from cca_zoo.preprocessing import MultiViewPreprocessing
from cca_zoo.linear import rCCA

import numpy as np

X=np.random.normal(size=(100,10))
Y=np.random.normal(size=(100,10))

rcca=rCCA()
pipe=Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(), StandardScaler()))),
    ('rcca', rcca)
])

param_grid = {
    "rcca__c":[0.1,0.2,0.3],
}
gs= GridSearchCV(pipe, param_grid, cv=2, verbose=1, n_jobs=1)
gs.fit((X,Y))

Think it would be fairly easy to use and works much like the scikit-learn api then. What do you reckon? :)

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

I've chucked it in a new module cca_zoo.preprocessing

Think this would also allow deconfounding for each view by passing a deconfounding preprocessing step to MultiViewProcessing which is something people have spoken to me about

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on June 18, 2024

Interesting! Will try this out :) If you push on the dev-branch, I'd also happy to test it :)

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on June 18, 2024

Would be nice to do:

pipe=Pipeline([
    ('preprocessing', MultiViewPreprocessing((None, StandardScaler()))),
])

or

pipe=Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),None))),
])

to simply apply nothing to a particular view. I would need that for my current analysis. My brain-variables are already scaled but my behavioral variables are not. Would like to apply the scaling only to the behavioral variables.

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Great feedback!

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Pushed new version to main which allows this

from cca_zoo.

jameschapman19 avatar jameschapman19 commented on June 18, 2024

Cool after brushing all this up the latest release is MUCH cleaner for alternating optimization methods like SCCA_PMD.

Also has a nice tqdm readout if you have verbose=True showing how far through the components it is and how far through the iterations it is.

from cca_zoo.

JohannesWiesner avatar JohannesWiesner commented on June 18, 2024

Nice, thanks so much! <3

from cca_zoo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.