Giter Site home page Giter Site logo

radlfabs / flexcv Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 3.77 MB

Python package customizing nested cross validation for tabular data.

Home Page: https://radlfabs.github.io/flexcv/

License: MIT License

Python 100.00%
cross-validation grouped-datasets machine-learning mixed-effects mixed-effects-models neptune-ai nested-cross-validation regression regression- regression-models sklearn tabular-data xgboost

flexcv's Introduction

Flexible Cross Validation and Machine Learning for Regression on Tabular Data

tests DOI

Authors: Fabian Rosenthal, Patrick Blättermann and Siegbert Versümer

Introduction

This repository contains the code for the python package flexcv which implements flexible cross validation and machine learning for tabular data. It's code is used for the machine learning evaluations in Versümer et al. (2023). The core functionality has been developed in the course of a research project at Düsseldorf University of Applied Science, Germany.

flexcv is a method comparison package for Python that wraps around popular libraries to easily taylor complex cross validation code to your needs.

It provides a range of features for comparing machine learning models on different datasets with different sets of predictors customizing just about everything around cross validations. It supports both fixed and random effects, as well as random slopes.

Install the package and give it a try:

pip install flexcv

You can find our documentation here.

Features

The flexcv package provides the following features:

  1. Cross-validation of model performance (generalization estimation)
  2. Selection of model hyperparameters using an inner cross-validation and a state-of-the-art optimization provided by optuna.
  3. Customization of objective functions for optimization to select meaningful model parameters.
  4. Fixed and mixed effects modeling (random intercepts and slopes).
  5. Scaling of inner and outer cross-validation folds separately.
  6. Easy usage of the state-of-the-art logging dashboard neptune to track all of your experiments.
  7. Adaptations for cross validation splits with stratification for continuous target variables.
  8. Easy local summary of all evaluation metrics in a single table.
  9. Wrapper classes for the statsmodels package to use their mixed effects models inside of a sklearn Pipeline. Read more about that package here.
  10. Uses the merf package to apply correction for clustered data using the expectation maximization algorithm and supporting any sklearn BaseEstimator. Read more about that package here.
  11. Inner cross validation implementation that let's you push groups to the inner split, e. g. to apply GroupKFold.
  12. Customizable ObjectiveScorer function for hyperparameter tuning, that let's you make a trade-off between under- and overfitting.

These are the core packages used under the hood in flexcv:

  1. sklearn - A very popular machine learning library. We use their Estimator API for models, the pipeline module, the StandardScaler, metrics and of course wrap around their cross validation split methods. Learn more here.
  2. Optuna - A state-of-the-art optimization package. We use it for parameter selection in the inner loop of our nested cross validation. Learn more about theoretical background and opportunities here.
  3. neptune - Awesome logging dashboard with lots of integrations. It is a charm in combination with Optuna. We used it to track all of our experiments. Neptune is quite deeply integrated into flexcv. Learn more about this great library here.
  4. merf - Mixed Effects for Random Forests. Applies correction terms on the predictions of clustered data. Works not only with random forest but with every sklearn BaseEstimator.

Why would you use flexcv?

Working with cross validation in Python usually starts with creating a sklearn pipeline. Pipelines are super useful to combine preprocessing steps with model fitting and prevent data leakage. However, there are limitations, e. g. if you want to push the training part of your clustering variable to the inner cross validation split. For some of the features, you would have to write a lot of boilerplate code to get it working, and you end up with a lot of code duplication. As soon as you want to use a linear mixed effects model, you have to use the statsmodels package, which is not compatible with the sklearn pipeline. flexcv solves these problems and provides a lot of useful features for cross validation and machine learning on tabular data, so you can focus on your data and your models.

Getting Started

Let's set up a minimal working example using a LinearRegression estimator and some randomly generated regression data. The CrossValidation class is the core of this package. It holds all the information about the data, the models, the cross validation splits and the results. It is also responsible for performing the cross validation and logging the results. Setting up the CrossValidation object is easy. We can use method chaining to set up our configuration and perform the cross validation. You might be familiar with this pattern from pandas and other packages. The set-methods all return the CrossValidation object itself, so we can chain them together. The perform method then performs the cross validation and returns the CrossValidation object again. The get_results method returns a CrossValidationResults object which holds all the results of the cross validation. It has a summary property which returns a pandas.DataFrame with all the results. We can then use the to_excel method of the DataFrame to save the results to an excel file.

# import the interface class, a data generator and our model
from flexcv import CrossValidation
from flexcv.synthesizer import generate_regression
from flexcv.models import LinearModel
  
# generate some random sample data that is clustered
X, y, group, _ = generate_regression(10, 100, n_slopes=1, noise_level=9.1e-2, random_seed=42)

# instantiate our cross validation class
cv = CrossValidation()

# now we can use method chaining to set up our configuration perform the cross validation
results = (
    cv
    .set_data(X, y, group, dataset_name="ExampleData")
    # configure our split strategies. Lets go for a GroupKFold since our data is clustered
    .set_splits(split_out="GroupKFold")
    # add the model class
    .add_model(LinearModel)
    .perform()
    .get_results()
)

# results has a summary property which returns a dataframe
# we can simply call the pandas method "to_excel"
results.summary.to_excel("my_cv_results.xlsx")

You can then use the various functions and classes provided by the framework to compare machine learning models on your data. Additional info on how to get started working with this package will be added here soon as well as to the (documentation)[radlfabs.github.io/flexcv/].

Documentation

Have a look at our documentation. We currently add lots of additional guides and tutorials to help you get started with flexcv. If you are interested in writing a guide or tutorial, feel free to contact us. It would be great to have some community contributions here.

Conclusion

flexcv is a powerful tool for comparing machine learning models on different datasets with different sets of predictors. It provides a range of features for cross-validation, parameter selection, and experiment tracking. With its state-of-the-art optimization package and logging dashboard, it is a valuable addition to any machine learning workflow.

Earth Extension

An wrapper implementation of the Earth Regression package for R exists that you can use with flexcv. It is called flexcv-earth. It is not yet available on PyPI, but you can install it from GitHub with the command pip install git+https://github.com/radlfabs/flexcv-earth.git. You can then use the EarthModel class in your flexcv configuration by importing it from flexcv_earth. Further information is available in the documentation.

Acknowledgements

We would like to thank the developers of sklearn, optuna, neptune and merf for their great work. Without their awesome packages and dedication, this project would not have been possible. The logo design was generated by Microsoft Bing Chat Image Creator using the prompt "Generate a logo graphic where a line graph becomes the letters 'c' and 'v'. Be as simple and simplistic as possible."

Contributions

We welcome contributions to this repository. Feel free to open an issue or pull request if you have any suggestions, problems or questions. Since the project is maintained as a side project, we cannot guarantee a quick response or fix. However, we will try to respond as soon as possible. We strongly welcome contributions to the documentation and tests. If you have any questions about contributing, feel free to contact us.

About

flexcv was developed at the Institute of Sound and Vibration Engineering at the University of Applied Science Düsseldorf, Germany and is now published and maintained by Fabian Rosenthal as a personal project.

flexcv's People

Contributors

radlfabs avatar

Stargazers

 avatar

Watchers

 avatar  avatar

flexcv's Issues

Testing Suite takes too long

Due to the iterative nature of processes and evaluations in flexcv the testing suite just takes too long, e.g. > 4 hrs.
This is to reduce testing times as much as possible.

Fit_Kwargs to allow custom callbacks

We need to allow custom callbacks to allow people to use e.g. the neptune-xgboost integration.

For example, the XGBRegressor with the sklearn interface has the following keyword argument in .fit():

callbacks: Optional[Sequence[TrainingCallback]] = None,

Which is defined as

callbacks (Optional[List[TrainingCallback]]) –

List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

A callback where a run object is passed (e.g. in the neptune-xgboost integration) would only work without dummy instantiation.
Therefore, we could create create a dict in the mappings in our interface as follows

model_callback = CustomCallback(run)
CrossValidation.config["mapping"]["model_name"]["fit_kwargs"] = {"callbacks" = [model_callback]} if model_callback is not None else {}

Then in core.cross_validate we could instantiate the fit_kwargs dict by indexing the model mapping

fit_kwargs = mapping[model_name]["fit_kwargs"]

Which then will be unpacked to the fit call.

Make test cases

  • to check if the interface assignment works
  • a mock callback was called.

Add make_model_mapping function

When first starting with the CrossValidation object the user needs to instantiate a ModelMappingDict which uses ModelConfigDict.
This might seem unintuitive for users coming from sklearn. Also users have to know, they would need to import or use this.

Possible solution:

  • implement a function that helps making this mapping with an easy name
  • check if this maker function could also be used in method chaining on CrossValidation
  • CrossValidation could also implement a check if the dict is already an instance of ModelMappingDict, so the user does not have to call it before.

Bug: RF is non deterministic

Random Forest seems to be not deterministic in my test case with a simple kfold/kfold nested cross validation and only tuning max_depth in optua.

Rename KFold Classes

Rename the custom KFold classes meaningfully.
It should be clear what they do and what pro/cons are
Update the doc strings where needed.

Bug: Inproper display of type annotations in reference.md

In Reference.md some type annotations are irregular parsed. This seems to be module related.
For some modules the types are not recognized at all and are parsed as part of the descriptions, i.e. displayed in the wrong column.
Working in interface.py, not working e.g. in results_handling.py.
Correct:
image
Incorrect:
image

Build Package

The python package has to be build when everything is ready.

  • Generate binaries/archive with build

  • Test if the archive has all files

yaml NameError

from flexcv import CrossValidation
Traceback (most recent call last):
File "", line 1, in
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv_init_.py", line 2, in
from .interface import CrossValidation
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv\interface.py", line 28, in
from .yaml_parser import read_mapping_from_yaml_file, read_mapping_from_yaml_string
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv\yaml_parser.py", line 11, in
loader: yaml.SafeLoader, node: yaml.nodes.ScalarNode
^^^^
NameError: name 'yaml' is not defined

Bug: pformat_dict not imported from utilities

from flexcv import CrossValidation
from flexcv.synthesizer import generate_regression
X, y, groups, _ = generate_regression(12, 300, 10)
from flexcv.models import LinearModel

cv = CrossValidation()

performed = (
    cv
    .set_data(X, y, groups)
    .set_splits()
    .add_model(LinearModel)
    .perform()
    .get_results()
)

raises "NameError: name 'pformat_dict' is not defined".

Problem is here:

return f"CrossValidationResults {pformat_dict(self)}"

We need to import pformat_dict from flexcv.utilities.

Update docs after changes to interface

In #13 we make some decisive steps by generalizing the interface and reducing model dependencies inside core functions. The docs need to be updated regarding

  • CrossValidation.set_merf
  • CrossValidation.set_lmer
  • ModelConfigDict -> whats necessary to set here? what individual value will override global values?
  • Update reference for PostProcessors.

Move Earth Wrapper Class & rpy2 dependency

Rpy2 is raising errors from CFFI. This does not harm the computational side at the moment but results in annoying pop ups.

Screenshot 2023-11-13 155020

Related issues in rpy2: rpy2/rpy2#1063
This may be fixed in rpy2/rpy2#1020 but remains unsolved to day.

Therefore I will move my rpy2 part regarding the earth regressor to flexcv-earth.

Update Docs

Installation guide in the docs must adress pip proceude when package is on pypi.

Upload package

  1. Upload to Test PyPI
  2. Test the package
  3. Upload to PYPI
  4. Release to Github

Create DOI

A citable DOI has to be generated.

  1. Make a GitHub release
  2. go to linked Zenodo Account and create DOI
  3. add badge link to the readme and update

Allow passing KFold Iterator directly in the CrossValidation class

At the moment the CrossValidation interface takes either strings or proprietary enums that are parsed as split methods in the process.
It would be benefitial to allow passing of cross validators directly, so users could easily write and integrate their own class or simply pass an sklearn iterator directly.
Currently the parsing is done by flexcv.split.make_cross_val_split. This could check for a Callable Cross Val Iterator and return its split method if True.

Allow Classification

Currently flexcv is based around regression.
Let's check what has to be done in order to allow classifiers as well.

  • Metrics: test with the battery of Acc/Recall/F1 etc.
  • Any new args/kwargs for the models?
  • Test with other split methods for classes
  • allow categorical transformers

Complete Testsuite

  • unit test core and utility modules
  • rather complex pytest runs and combinations of models
  • comparison tests to sklearn wherever possible

Update Getting Started

As soon as the interface/API is fixed, we need a great Getting Started in both docs and readme.

README code fails

The README has some bugs in the example code:

  1. A non closing paranthesis here
  2. And a missing import statement of the module
    Alternatively,
    method_outer_split=flexcv.CrossValMethod.GROUP
    could could be changed to method_outer_split="GroupKFold", in order to avoid the module level import.

Tests for our Docs

Tests for the docs are missing!

  1. Would be nice to check code blocks.
  2. Check markdown syntax
  3. Does it build correctly?

YAML Parser for Model Mapping

For reuse of model configurations it would be nice to work with yaml files.
Therefore, a parser is needed. Main features:

  • translate yaml ModelMappingDicts
  • create inner dicts as ModelConfigDicts
  • instantiate the parameter distributions for the models
  • make imports of model classes and post_processor classes

This will make it easier to reuse larger configurations.

Python 3.12

Add compatibility for Python 3.12.
Currently awaiting neptune support (issue)
which is relying on Numba (issue).

Update Tests for cross validate and its classes

After restructuring the cross validation loop we need some more tests for the

  • SingleModelFoldResult class and its make_results method
  • ProsProcessing Classes
  • for some of the interface functionality
  • for the new defaults in model mapping
  • for the model mapping -> interface interaction
  • for the "add_merf" keyword
  • for the kwarg handling in the interface

Reduce model-loop: minimize LMER/MERF dependencies in cross_validate

Currently the cross validation loop iterates over base estimators..
With a flag we enter a second code block inside the loop where the mixed effects modeling takes place.
Therefore, LMER and MERF need to be treated differently as another model type as any other base estimator.
This is also due to the differences in API.
It would be benefitial to reduce this second part of the loop. MERF will always be in a different hierarchy because it is optimizing on top of base estimators. Unfortunately also LMER can not be base model inside MERF because MERF only supports fixed effects fitting internally for the base model.
A possible solution would be, to add a flag "fit_merf" to the ModelConfigDict, which triggers the part of the loop after fitting the base estimator to also enter into MERF fitting and evaluation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.