Giter Site home page Giter Site logo

sktime / skpro Goto Github PK

View Code? Open in Web Editor NEW
225.0 9.0 42.0 11.86 MB

A unified framework for tabular probabilistic regression and probability distributions in python

Home Page: https://skpro.readthedocs.io/en/latest

License: BSD 3-Clause "New" or "Revised" License

Python 99.74% Makefile 0.16% Shell 0.05% Dockerfile 0.06%
python sklearn data-science framework prediction machine-learning ai probabilistic-models probability-distributions regression

skpro's Introduction

๐Ÿš€ Version 2.4.1 out now! Read the release notes here..

skpro is a library for supervised probabilistic prediction in python. It provides scikit-learn-like, scikit-base compatible interfaces to:

  • tabular supervised regressors for probabilistic prediction - interval, quantile and distribution predictions
  • tabular probabilistic time-to-event and survival prediction - instance-individual survival distributions
  • metrics to evaluate probabilistic predictions, e.g., pinball loss, empirical coverage, CRPS, survival losses
  • reductions to turn scikit-learn regressors into probabilistic skpro regressors, such as bootstrap or conformal
  • building pipelines and composite models, including tuning via probabilistic performance metrics
  • symbolic probability distributions with value domain of pandas.DataFrame-s and pandas-like interface
Overview
Open Source BSD 3-clause
Tutorials Binder !youtube
Community !discord !slack
CI/CD github-actions !codecov readthedocs platform
Code !pypi !conda !python-versions !black
Downloads PyPI - Downloads PyPI - Downloads Downloads
Citation DOI

๐Ÿ“š Documentation

Documentation
โญ Tutorials New to skpro? Here's everything you need to know!
๐Ÿ“‹ Binder Notebooks Example notebooks to play with in your browser.
๐Ÿ‘ฉโ€๐Ÿ’ป User Guides How to use skpro and its features.
โœ‚๏ธ Extension Templates How to build your own estimator using skpro's API.
๐ŸŽ›๏ธ API Reference The detailed reference for skpro's API.
๐Ÿ› ๏ธ Changelog Changes and version history.
๐ŸŒณ Roadmap skpro's software and community development plan.
๐Ÿ“ Related Software A list of related software.

๐Ÿ’ฌ Where to ask questions

Questions and feedback are extremely welcome! We strongly believe in the value of sharing help publicly, as it allows a wider audience to benefit from it.

skpro is maintained by the sktime community, we use the same social channels.

Type Platforms
๐Ÿ› Bug Reports GitHub Issue Tracker
โœจ Feature Requests & Ideas GitHub Issue Tracker
๐Ÿ‘ฉโ€๐Ÿ’ป Usage Questions GitHub Discussions ยท Stack Overflow
๐Ÿ’ฌ General Discussion GitHub Discussions
๐Ÿญ Contribution & Development dev-chat channel ยท Discord
๐ŸŒ Community collaboration session Discord - Fridays 13 UTC, dev/meet-ups channel

๐Ÿ’ซ Features

Our objective is to enhance the interoperability and usability of the AI model ecosystem:

  • skpro is compatible with scikit-learn and sktime, e.g., an sktime proba forecaster can be built with an skpro proba regressor which in an sklearn regressor with proba mode added by skpro

  • skpro provides a mini-package management framework for first-party implemenentations, and for interfacing popular second- and third-party components, such as cyclic-boosting or MAPIE packages.

skpro curates libraries of components of the following types:

Module Status Links
Probabilistic tabular regression maturing Tutorial ยท API Reference ยท Extension Template
Time-to-event (survival) prediction maturing API Reference ยท Extension Template
Performance metrics maturing API Reference
Probability distributions maturing Tutorial ยท API Reference ยท Extension Template

โณ Installing skpro

To install skpro, use pip:

pip install skpro

or, with maximum dependencies,

pip install skpro[all_extras]

Releases are available as source packages and binary wheels. You can see all available wheels here.

โšก Quickstart

Making probabilistic predictions

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from skpro.regression.residual import ResidualDouble

# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)

# step 2: specifying the regressor - any compatible regressor is valid!
# example - "squaring residuals" regressor
# random forest for mean prediction
# linear regression for variance prediction
reg_mean = RandomForestRegressor()
reg_resid = LinearRegression()
reg_proba = ResidualDouble(reg_mean, reg_resid)

# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)

# step 4: predicting labels on new data

# probabilistic prediction modes - pick any or multiple

# full distribution prediction
y_pred_proba = reg_proba.predict_proba(X_new)

# interval prediction
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)

# quantile prediction
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])

# variance prediction
y_pred_var = reg_proba.predict_var(X_new)

# mean prediction is same as "classical" sklearn predict, also available
y_pred_mean = reg_proba.predict(X_new)

Evaluating predictions

# step 5: specifying evaluation metric
from skpro.metrics import CRPS

metric = CRPS()  # continuous rank probability score - any skpro metric works!

# step 6: evaluat metric, compare predictions to actuals
metric(y_test, y_pred_proba)
>>> 32.19

๐Ÿ‘‹ How to get involved

There are many ways to get involved with development of skpro, which is developed by the sktime community. We follow the all-contributors specification: all kinds of contributions are welcome - not just code.

Documentation
๐Ÿ’ Contribute How to contribute to skpro.
๐ŸŽ’ Mentoring New to open source? Apply to our mentoring program!
๐Ÿ“… Meetings Join our discussions, tutorials, workshops, and sprints!
๐Ÿ‘ฉโ€๐Ÿ”ง Developer Guides How to further develop the skpro code base.
๐Ÿ… Contributors A list of all contributors.
๐Ÿ™‹ Roles An overview of our core community roles.
๐Ÿ’ธ Donate Fund sktime and skpro maintenance and development.
๐Ÿ›๏ธ Governance How and by whom decisions are made in sktime's community.

๐Ÿ‘‹ Citation

To cite skpro in a scientific publication, see citations.

skpro's People

Contributors

alex-jg3 avatar an20805 avatar bhavikar04 avatar dependabot[bot] avatar duydl avatar fkiraly avatar frthjf avatar iwitaly avatar julian-fong avatar malikrafsan avatar meraldoantonio avatar nilesh05apr avatar ram0nb avatar sairevanth25 avatar setoguchi-naoki avatar shreesham07 avatar sukjingitsit avatar szepeviktor avatar yarnabrina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skpro's Issues

[ENH] interface `XGBoostLSS` et al by `StatMixedML`

It would be great to interface the various probabilistic supervised regressors of StatMixedML, so they can then immediately used for forecasting in sktime via skpro!

FYI @StatMixedML, @joshdunnlime

Many thanks to @KiwiAthlete for the suggestion!

[ENH] MAPIE interface

MAPIE is one of the more popular "favourite algorithm" repositories in the probabilistic supervised learning space.

Their scope is more geared towards multiple tasks (incl classification, time series forecasting, etc), whereas skpro has a more stringent framework layer imo.

The specific estimator that would be great to interface is the MapieRegressor:
https://github.com/scikit-learn-contrib/MAPIE/blob/master/mapie/regression/regression.py

should be straightforward with the extension template.
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py

FYI @gmartinonQM, @thibaultcordier, @LacombeLouis, @vincentblot28, @vtaquet, @SZiane in case there might be interest in close collaboration - I see you are planning to add further models, and you are also following closely sklearn as a design template, so perhaps we should merge base class architectures?

For context, skpro is a 2017 package for proba supervised (tabular) regression which is one of the design antecessors of sktime. Parts of it became the sktime probabilistic forecasting module, now it has been rearchitected using skbase. Of course a number of proba supervised regression packages have seen the light of day since 2017.

sktime and skpro adopt a framework integration philosophy where third party packages can be easily interfaced and used as components in pipelines etc, we manage dependencies on the level of estimators.
https://www.sktime.net/en/stable/

For instance, skpro probabilistic tabular regressors can be used in sktime probabilistic forecasting pipelines.

[ENH] multiple quantile regression

Is your feature request related to a problem? Please describe.

For quantile regression, often more than one quantile probability is of interest. However, existing Sklearn compatible quantile regressors always fit and predict a single quantile probability. To the best of my knowdlegde, there is no standardized way to integrate multiple quantile regression with Sktime/Skpro probabilistic prediction methods such as predict_quantiles/predict_intervals.

Describe the solution you'd like

New Skpro regressor that wraps multiple quantile regressors and supports probabilistic predictions from wrapped regressors.

Describe alternatives you've considered

None, as the proposed solution is already discussed in Sktime issue sktime/sktime#5357

[ENH] explicit/analytic form of energy function for chi-squared distribution

There does not seem to exist a literature reference for the energy functionals of the chi-squared distribution, we should try to derive it, or find a reference.

Collecting discussion below, from #22.

Current state:

  • explicit formula for the cross-term $\mathbb{E}[|X-y|]$ derived, but not posted yet
  • no progress yet on the self-term $\mathbb{E}[|X-X'|]$

@sukjingitsit has been working on this.

Would you like to post your partial progress?

Distributions as return objects

Re-opening the sub-issue opened in #3 and commented upon by @murphyk

Question: should skpro's predict methods return a vector of distribution objects?
For example, using the distributions from scipy.stats which implement methods pdf, cdf, mean, var, etc.

Pro:

  • this would be using an existing, consolidated, and well-supported interface
  • it might be easier to use
  • it might be easier to understand

Contra:

  • mixture types are not supported
  • l2 norm is not supported (as would be needed for squared/Gneiting loss)
  • mixed distributions on the reals, especially empirical distributions (weighted sum of deltas) which are returned by Bayesian packages are not supported
  • vectors of distributions are not supported, alternatively Cartesian products of distributions
  • this is not the status quo

[ENH] Calibration plots for probabilistic predictions

@benHeid made nice calibration plots here: sktime/sktime#5632

So I wonder:

  • should these also be added in skpro for probabilistic tabular regressors? Same principle works.
  • any reasons why it could not be shifted via copy/paste and import change?

Mid-term, it might be best to have them only in skpro, and make sktime forecasting rely more on skpro distributions machinery and utilities.

FYI @benHeid, any thoughts?

[ENH] further improvements on `BaseDistribution` and tests

From implementing a few distributions, some insights on potential improvements:

  • the broadcasting logic could/should be abstracted to apply to a certain subset of parameters rather than be copy-pasted in every class
  • there is repeated boilerplate, this could be resolved via split into public/private methods such as pdf/_pdf, similar to sktime's fit/_fit
  • we should add logical consistency tests, e.g., cdf being inverse to ppf, or pdf and log_pdf being compatible, or more sophisticated tests such as subtracting the mean shifting functions (or not affecting them) accordingly
  • worth thinking about: logical consistency tests of MC approximations against exact implementation, although that can eat a lot of runtime (so perhaps not?)
  • some distributions, such as empirical or composites, need careful thinking about the subsetting logic - so an extension contract for iloc and loc indexing needs to be thought out so it can affect parameters
  • row/column subsetting via loc, iloc, should be tested

[ENH] interface to cyclic boosting package

It would be nice to interface cyclic_boosting, which provides implementations of the cyclic boosting regressors.
Blue-Yonder-OSS/cyclic-boosting#56

They are popular in proba forecasting, but by nature are actually proba supervised regerssion algoritihms so "belong" in skpro, from where it can be interfaced by sktime for forecasting.

The interface to implement is in this extension template:
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py

FYI @rijkvandermeulen, @FelixWick

[ENH] support for row multi-index in distributions

Distributions should be able to support a row multi-index, i.e., pd.MultiIndex - as in sktime hierarchical and panel data.

This is probably automatic for most distributions, but not immediate for, e.g., Empirical, where parameterization is in terms of additional index levels.

In addition, testing should be carried out with multi-index examples in the general suite.

Related: PR in sktime which adds multiindex support to Empirical
sktime/sktime#6066

[ENH] adding truncated means as an interface point?

API design and math question: would it make sense to add truncated means as general interface points? Or, more generally, truncated moments?

API design-wise, having it would help with:

  • computation of energy, one of the terms is explicit in cdf and truncated means, see here: #22 (comment)
  • truncation compositor, which then would have access to more analytical formulae

questions:

  • if we add it, how? I would add additional arguments to the existing mean method.

Math-wise, the questions are:

  • how easy are truncated mean and moments to obtain - matching against current literature coverage, and "potential" coverage in terms of analytic forms
  • what are the analytic relations between these, and other distribution defining functions or methods currently present?

umbrella issue - `scikit-base` based rearchitecture for `skpro` v2

  • 1st part: rearchitecting on scikit-base #9 -
    this unlocks creating new estimators, distributions, and moving legacy estimators to scikit-base compatible interface
  • 2nd part to cover probabilistic metrics and example notebooks
    this unlocks more metrics and moving legacy metrics #11
  • 3rd part - predict_interval, predict_quantiles etc, as in sktime #18

Moving individual estimators from legacy to scikit-base interface

  • complete missing features in residual two-step
  • bagging estimator
  • density/distribution estimator naive baseline (should include kernel and empirical)
  • composite parametric
  • bounding wrapper
  • grid and random tuning (requires metrics)

Moving distirbutions from legacy to scikit-base interface

  • Laplace
  • Cauchy
  • t
  • empirical
  • mixture (compositor)

Moving metrics from legacy to scikit-base interface - can be reused from sktime

  • CRPS
  • log-loss
  • capped log-loss
  • squared loss

Related: old refactor effort (superseded)
#6

[ENH] explicit/analytic form of energy function for log-normal distribution

There does not seem to exist a literature reference for the energy functionals of the log-normal distribution, we should try to derive it, or find a reference.

Collecting discussion below, from #214.

Current state:

  • explicit formula for the cross-term $\mathbb{E}[|X-c|]$ (almost) derived
  • no progress yet on the self-term $\mathbb{E}[|X-X'|]$

@bhavikar04 used Wolfram Alpha to derive the following indefinite integral related to the cross-term $\mathbb{E}[X-c]$:
Screenshot 2024-03-23 130300

My reply:
this looks correct. Now you need to add the limits. That should be an easy substitution, no? I recommend, do that manually. Use that

$\lim_{x\rightarrow -\infty} \mbox{erf}(x) = -1$, and $\lim_{x\rightarrow \infty} \mbox{erf}(x) = 1$. You need to be careful with the sign, but that should be it?

The number 0.707 etc should be $\frac{1}{2} \sqrt{2}$, but it doesn't matter for the limits.

First example: 'utils' not found

The first example in your documentation (DensityBaseline) does not run right on my machine: it throws a 'module not found' exception at the call to 'utils'.

This might be a python version problem (I am using 3.6), so perhaps it's not an error in the normal sense - though I don't see any specification that the package required a particular python version. Apologies if I missed it: in any case, I fixed it by importing matplotlib instead: i.e.

import matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)

instead of:

import utils
utils.plot_performance(y_test, y_pred)

[ENH] probabilistic `TransformedTargetRegressor`

The package should contain a TransformedTargetRegressor that is applicable to probabilistic regressors.

Most likely this would make use of #30.

For some distribution/transform combination, there might be bettere dispatchable transformed distributions.

[BUG] incorrectly low coverage reports

Coverage reports seem to give low percentages - even in cases where it should be clear that the estimator is tested.

Perpahs the coverage does not recognize the lines due to the incremental and matix testing?

[ENH] quantile crossing handling in multiple quantile regression

We should think how to handle quantile crossing in multiple quantile regression.

Original post of @Ram0nB:

Good one. We could provide the user a "method" parameter similar to Sktime's Imputer, provide the user a quantile_crossing_callable similar to Sktime's FunctionTransformer, or both. What are your thoughts on this @fkiraly ? Maybe we can open an enhancement issue for this for now?

Originally posted by @Ram0nB in #107 (comment)

[ENH] refactor - move `utils.git_diff` to `skbase`

Currently, utils.git_diff is a copy of the one in sktime, which is not DRY.

Both instances of the git_diff module should move to skbase, and be imported from there.

Possibly also includinng the fixture generator for the test classes.

After removal in the original packages, the skbase lower bound should be raised to guarantee presence of the import.

[DOC] extension template for probability distributions

We should write an extension template for probability distributions - possibly we may like to do this after some of the improvements in #21
which will change the extender contract slightly.

This should look similar to the sktime extension templates:
https://github.com/sktime/sktime/tree/main/extension_templates

From @Alex-JG3 in #21, it is important to highlight the interplay between the tags (what is exact/approximate?) and the option to not implement certain methods: There are four cases:

  • method must be implemented, or one out of k methods must be implemented
  • method need not be implemented as there is a default, but can be, to replace a less accuate or less efficient default

combined with:

  • method results in numerically exact return
  • method return is approximate

[ENH] interface probabilistic regressors from `ngboost` package

Update by @fkiraly - we should interface ngboost regressors as probabilistic supervised (tabular) regressors.
As discussed in the below, this should be an skpro regressor, which in turn can be used in sktime reduction forecasters (such as YfromX). The original request was for using ngboost as a forecaster in sktime, but this is a tabular proba regressor that needs to go through an additional reduction step (which is now implemented in sktime).

It should be straightforward to interface ngboost using the probabilistic regressor extension template:
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py
so adding it as a good first issue.

The main techincal concern might be translating the ngboost probability distributions into skpro probability distributions, but that should also be addressable with a lean adaptation layer (personally, I would add that adaptation in an adapter utility subpackage in regression.ngboost).

Original request below.


Is your feature request related to a problem? Please describe.
Can we build a Probabilistic Forecaster using Ngboost. A Probabilistic regression like Ngboost method will give us the confidence intervals out of the box unlike others where we need a heuristic(like quantile loss) to get the values of Conf/Pred Ints.

Describe the solution you'd like
A rough sketch:

class ProbForecaster:
     def __init__():
           self.estimator_ = NgboostRegressor()
     def fit(y,X=None):
           y = make_reduction(y)
           self.estimator_.fit(y)
     def predict(fh,X=none,alpha=0.95,conf_int=True):
           mean_preds =  self.estimator_.predict(y)
           distribution =  self.estimator_.pred_dist(y)
           conf_ints = scipy.norm.interval(alpha, distribution.loc, distribution.scale )
           return mean_preds,conf_ints

Describe alternatives you've considered
Using Xgboost/LightGBM with quantile loss makes the calculation of prediction intervals inefficient.

Additional context
Ngboost Doc- https://stanfordmlgroup.github.io/projects/ngboost/

[BUG] QPD distribution and `CyclicBoosting` API non-compliance

@setoguchi-naoki, @FelixWick, I have to apologize in advance for this.

Due to an error in the test logic introduced in an update to class retrieval (see #189), probability distributions went uncovered for most of your PR's duration.

This is my fault for not noticing, and breaking it in the first place.

Bad news is that the QPD distribution have a few non-compliances:

  • cdf method
  • mean, var
  • subsetting

CyclicBoosting as well:

  • _predict_quantiles arg should be alpha, not quantiles

Good news is that this has not been released yet, and I was running more tests before the 2.2.0 release, which is what ultimately caught the problem.

I'll simply wait with 2.2.0 until we had time to fix this - happy to help.

For testing locally, you need to depend on the branch #189 until it is merged.

[ENH] roadmap of probabilistic regressors to implement or to interface

A wishlist for probabilistic regression methods to implement or interface.
This is partly copied from the list I made when designing the R counterpart mlr-org/mlr3proba#32 .
Number of stars at the end is estimated difficulty or time investment.

GLM

  • generalized linear model(s) with continuous regression link, e.g., Gaussian *
    • Gaussian link, statsmodels
    • further regression links: Gamma, Tweedie, inverse Gaussian
  • generalized linear model(s) with count link, e.g., Poisson *
    • Poisson link, statsmodels
    • Poisson link, sklearn
    • further links: Binomial
  • heteroscedastic linear regression ***
  • Bayesian GLM where conjugate priors are available, e.g., GLM with Gaussian link ***

KRR aka Gaussian process regression

  • vanilla kernel ridge regression with fixed kernel parameters and variance *
  • kernel ridge regression with MLE for kernel parameters and regularization parameter **
  • heteroscedastic KRR or Gaussian processes ***

CDE

  • variants of conditional density estimation (Nadaraya-Watson type) **
  • reduction to density estimation by binning of input variables, then apply unconditional density estimation **

Gradient boosting and tree-based

  • ngboost package interface *
  • probabilistic residual boosting **
  • probabilistic regression trees **

Neural networks

  • interface tensorflow probability - some hard-coded NN architectures **
  • generic tensorflow probability interface - some hard-coded NN architectures ***

Bayesian toolboxes

  • generic pymc3 interface ***
  • generic pyro interface ****
  • generic Stan interface ****
  • generic JAGS interface ****
  • generic BUGS interface ****
  • generic Bayesian interface - prior-valued hyperparameters *****

Pipeline elements for target transformation

  • distr fixed target transformation **
  • distr predictive target calibration **

Composite techniques, reduction to deterministic regression

  • stick mean, sd, from a deterministic regressor which already has these as return types into some location/scale distr family (Gaussian, Laplace) *
  • use model 1 for the mean, model 2 fit to residuals (squared, absolute, or log), put this in some location/scale distr family (Gaussian, Laplace) **
  • upper/lower thresholder for a regression prediction, to use as a pipeline element for a forced lower variance bound **
  • generic parameter prediction by elicitation, output being plugged into parameters of a distr object not necessarily scale/location ****
  • reduction via bootstrapped sampling of a determinstic regressor **

Ensembling type pipeline elements and compositors

  • simple bagging, averaging of pdf/cdf **
  • probabilistic boosting ***
  • probabilistic stacking ***

baselines

  • always predict a Gaussian with mean = training mean, var = training var *
  • unconditional densities via distfit package, interface *
  • IMPORTANT as featureless baseline: reduction to distr/density estimation to produce an unconditional probabilistic regressor **
  • IMPORTANT as deterministic style baseline: reduction to deterministic regression, mean = prediction by det.regressor, var = training sample var, distr type = Gaussian (or Laplace) **

Other reduction from/to probabilistic regression

  • reducing deterministic regression to probabilistic regression - take mean, median or mode **
  • reduction(s) to quantile regression, use predictive quantiles to make a distr ***
  • reducing deterministic (quantile) regression to probabilistic regression - take quantile(s) **
  • reducing interval regression to probabilistic regression - take mean/sd, or take quantile(s) **
  • reduction to survival, as the sub-case of no censoring **
  • reduction to classification, by binning ***

[ENH] update `evaluate` and tuning to survival models

The evaluation utilities and tuners for probabilistic regression work with survival prediction models introduced in #157, but do not pass C through.

They should be updated - before the release containing #157 - to pass C through to models fitted on resamples.

[ENH] Add Lambert W x F distributions

Is your feature request related to a problem? Please describe.

For modeling skewed and/or heavy-tailed distributions i'd like to have support for Lambert W x F distributions. On top of modeling, Lambert W x F distribution allow to "Gaussianize" the observed data.

This is especially useful / prevalent for financial time series data, which is often skewed and/or heavy-tailed.

Describe the solution you'd like

This exists in the LambertW R package and the pylambertw Python module, which is an sklearn transformer/estimator wrapper around torchlambertw.

Describe alternatives you've considered

Other heavy-tailed distributions; but none of the typical ones allow the ease of itnerpretation of the heavy-tail parameter, the input/output system view of transformation, and a bijective back-transformation.

Additional context

I'd be happy to open a PR to implement a first version of Lambert W x Gaussian distributions, but would like some guidance/pointers on best practices for skpro.

[ENH] refactor - move `_check_soft_dependencies` to `skbase`

Currently, _check_soft_dependencies is a copy of the one in sktime, which is not DRY.

Both instances of _check_soft_dependencies should move to skbase, and be imported from there.

After removal in the original packages, the skbase lower bound should be raised to guarantee presence of the import.

documentation: np.mean(y_pred) does not work

I'm following along with this intro example.. However this line fails

(numpy.mean(y_pred) * 2).shape

Error below (seems to be because Distribution objects don't support the mean() function but instead insist on obscurely calling it point!)

np.mean(y_pred)
Traceback (most recent call last):

  File "<ipython-input-38-19819be87ab5>", line 1, in <module>
    np.mean(y_pred)

  File "/home/kpmurphy/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2920, in mean
    out=out, **kwargs)

  File "/home/kpmurphy/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 75, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)

TypeError: unsupported operand type(s) for +: 'Distribution' and 'Distribution'

problem in loading the skpro

It has been 2 days that I am trying to import skpro. But I can not I keep getting this error:

cannot import name 'six' from 'sklearn.externals' (C:\Users\My Book\anaconda3\lib\site-packages\sklearn\externals_init_.py)

skpro-refactoring (version-2)

See below some comments/description of the coming refactoring contents :

  • Distribution classes refactoring in a more OOD way (see. skpro->distribution)
  • Losse functions (see. metrics->distribution)
  • Estimators (see. metrics->distribution)

Some descriptive notebooks (in docs->notebooks) and a full set of unit test (in tests) are also available.

[BUG] The `cdf` for the `Empirical` distribution returns a dataframe of objects

Describe the bug

The cdf function for the Empirical distribution returns a dataframe of python objects. It should return a dataframe of floats or integers.

To Reproduce

>>> import pandas as pd
>>> import numpy as np
>>> from skrpo.distributions.empirical import Empirical
>>> spl_idx = pd.MultiIndex.from_product([[0, 1], [0, 1, 2]], names=["sample", "time"])
>>> spl = pd.DataFrame([[0, 1], [2, 3], [10, 11], [6, 7], [8, 9], [4, 5]], index=spl_idx, columns=["a", "b"])
>>> d = Empirical(spl)
>>> x = pd.DataFrame([[1, 2], [3, 4], [5, 6]], index=pd.Index([0, 1, 2]), columns=pd.Index(['a', 'b']))
>>> d.cdf(x)
     a    b
0  0.5  0.5
1  0.5  0.5
2  0.5  0.5
>>> d.cdf(x).values.dtype
dtype('O')
>>> d.cdf(x).dtypes
a    Float64
b    Float64

Expected behavior

The dataframe returned from cdf should contain floats of integers. I don't think Float64 is a valid pandas datatype - it should be float64. Here is an example creating a simple float dataframe.

>>> y = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]])
>>> y.dtypes
0    float64
1    float64
dtype: object
>>> y.values.dtype
dtype('float64')

Environment

Pip freeze.

anyio==3.7.1
backoff==2.2.1
certifi==2023.7.22
cfgv==3.4.0
coverage==7.3.0
distlib==0.3.7
exceptiongroup==1.1.3
execnet==2.0.2
filelock==3.12.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
identify==2.5.27
idna==3.4
iniconfig==2.0.0
joblib==1.3.2
nodeenv==1.8.0
numpy==1.24.4
packaging==23.1
pandas==2.0.3
platformdirs==3.10.0
pluggy==1.3.0
pre-commit==3.3.3
pytest==7.4.0
pytest-cov==4.1.0
pytest-randomly==3.15.0
pytest-timeout==2.1.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
scikit-base==0.5.1
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
skpro @ file:///home/alex/documents/skpro
sniffio==1.3.0
threadpoolctl==3.2.0
tomli==2.0.1
tzdata==2023.3
virtualenv==20.24.3

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Additional context

This problem arose when I was trying to write a test to check if cdf is the inverse of ppf.

[ENH] grid and random tuning

The package should include a grid and random tuning utility that can make use of predict_proba and probabilistic metrics already refactored.

This should hopefully be a minor adaptation exercise using the abstractions in sktime.

[ENH] ensure sortedness consistency in indices and test

We need to ensure consistency in various indices occurring:

  • return of predict, predict_proba indices
  • output of distribution methods' indices
  • input to distribution constructors, especially when nested

with respect to being sorted.

There is the question about the desired end state - I think indices should not automatially get sorted, and be consistent especially when unsorted.

This is due to sorting being a non-trivial operation that can bump prediction cost from O(n_test) to O(n_test log(n_test)) scaling (or worse in case of nestings).

[ENH] roadmap of probability distributions to implement

It would be great to have a basic set of probability distributions implemented.

Umbrella issue for implementing sktime probability distributions.

Recipe: use the extension_templates/distribution.py extension template.
Examples:

  • Normal, for de-novo implementations or manual interfaces
  • Fisk, for interfacing scipy distributions - this is much easier than using the full template

High priority:

  • laplace - #19
  • empirical incl delta - #25
  • multivariate normal

mid priority:

  • t-distribution - #49
  • Cauchy (special case of t) - #49
  • mixture composition - #26
  • truncation, compositor

low priority:

lower priority:

  • alpha #356
  • binomial
  • burr III
  • burr XII
  • erlang
  • f
  • fatigue-life
  • generalized Pareto
  • gamma #355
  • geometric
  • half-cauchy
  • half-normal
  • half-logistic
  • levy
  • log-gamma
  • log-laplace
  • negative binomial
  • pareto
  • skellam
  • truncated normal
  • truncated pareto

list of many more (lowest priority)
https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions - can be interfaced via _ScipyDist adapter easily!
https://en.wikipedia.org/wiki/File:ProbOnto2.5.jpg

Mirrors sktime/sktime#4518
(for high and mid priority)

Contributions can be made to either repository, and should be copied over to the other once approved/merged, until the modules are merged into one.

[BUG] readthedocs failures

Since recently, readthedocs builds fail with a json problem.

There was indeed a , missing in switcher.json, but that did not affect the 2.0.1 release, so I'm not sure what is going on here.

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/events.py", line 97, in emit
    results.append(listener.handler(self.app, *args))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/pydata_sphinx_theme/__init__.py", line 99, in update_config
    switcher_content = json.loads(content)
                       ^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 12 column 3 (char 226)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/cmd/build.py", line 293, in build_main
    app = Sphinx(args.sourcedir, args.confdir, args.outputdir,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/application.py", line 272, in __init__
    self._init_builder()
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/application.py", line 343, in _init_builder
    self.events.emit('builder-inited')
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/events.py", line 108, in emit
    raise ExtensionError(__("Handler %r for event %r threw an exception") %
sphinx.errors.ExtensionError: Handler <function update_config at 0x7f511f19b600> for event 'builder-inited' threw an exception (exception: Expecting ',' delimiter: line 12 column 3 (char 226))

Extension error (pydata_sphinx_theme):
Handler <function update_config at 0x7f511f19b600> for event 'builder-inited' threw an exception (exception: Expecting ',' delimiter: line 12 column 3 (char 226))

[ENH] energy for multivariate probability distributions with arbitrary k-norm

Currently, the energy method, if called on multivariate distributions (multiple columns), implements the 1-norm energy which is not strictly proper - as it is just the sum of marginal energies, and therefore is minimized by any multivariate distribution that has correct marginals (non-uniquely!). This is due to the default handling of multivariate which is column averaging or summation.

k-norms with k>1 are strictly proper afaik, but they do not fit the current interface which assumes column means/sums.

We may have to add a param to the energy function, or even a new method for multivariate energy - not sure what the best is design-wise.

The key issue is that the 1-norm and 2-norm energies often have closed form solutions or at least known ones that are efficient to compute, whereas other k may or may not have these.

For the extender contract and tag inspection, it means we must be able to cope with the situation where we may want to implement efficient special cases and leave the other cases to approximate routines.

Any good ideas? (@Alex-JG3, @frthjf)

[ENH] adapter for `scipy` distributions

Adapting scipy distributions is very formulaic and could easily be dealt with by an adapter class. This also avoids duplication in implementation efforts, as scipy is a core dependency.

This may require some abstraction around methods, but it seems mostly like a delegator class, as scipy does its own broadcasting.

Further discrepancies to be mindful of:

  • energy and similar integrals are not implemented in scipy
  • the class parameterization is different, scipy uses class methods whereas skpro uses __init__ based parameterization

Good first issue with a design/architecture flavour, can be leveraged to cover a lot of ground in #22.

[ENH] conda release

skpro is not yet available on conda, but should be.

Before we move some proba functionality from sktime into skpro, it must be available on conda.

[ENH] transformed distribution

The package should contain a transformed distribution, after applying a fitted sklearn transformer's transform or inverse_transform.

As these can be arbitrary, it will probably be necessary to estimate most distribution methods by Monte Carlo.

This will be useful in a probabilistic version of TransformedTargetRegressor.

[BUG] CDF for empirical distribution returns dataframe of objects not floats.

Describe the bug

The cdf function for the Empirical distribution returns daframes with cells that are of data type object.

To Reproduce

This example is taken from the get_test_params for the Empirical distribution.

>>> from skpro.distributions.empirical import Empirical
>>> import pandas as pd
>>> spl_idx = pd.MultiIndex.from_product(
...     [[0, 1], [0, 1, 2]], names=["sample", "time"]
... )
>>> spl = pd.DataFrame(
...     [[0, 1], [2, 3], [10, 11], [6, 7], [8, 9], [4, 5]],
...     index=spl_idx,
...     columns=["a", "b"],
... )
>>> spl.values.dtype
dtype('int64')
>>> x = pd.DataFrame(
...     [[0, 1], [2, 3], [10, 11]],
...     index=pd.Index([0, 1, 2]),
...     columns=pd.Index(['a', 'b'])
... )
>>> d = Empirical(spl)
>>> d.cdf(x)
     a    b
0  0.5  0.5
1  0.5  0.5
2  1.0  1.0
>>> d.cdf(x).values.dtype
dtype('O')

This problem arose when I was trying to write a test for checking if the ppf is the inverse of the cdf. Here is an example of what this test might look like and the error I get.

>>> import numpy as np
>>> x
    a   b
0   0   1
1   2   3
2  10  11
>>> d
Empirical(columns=Index(['a', 'b'], dtype='object'),
          index=Index([0, 1, 2], dtype='int64', name='time'),
          spl=              a   b
sample time
0      0      0   1
       1      2   3
       2     10  11
1      0      6   7
       1      8   9
       2      4   5)
>>> cdf = d.cdf(x)
>>> cdf.values.dtype
dtype('O')
>>> x_approx = d.ppf(cdf)
>>> x.values.dtype
dtype('int64')
>>> x_approx.values.dtype
dtype('O')
>>> np.allclose(x_approx, x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 200, in allclose
  File "/home/alex/documents/skpro/.venv/lib/python3.10/site-packages/numpy/core/numeric.py", line 2270, in allclose
    res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
  File "<__array_function__ internals>", line 200, in isclose
  File "/home/alex/documents/skpro/.venv/lib/python3.10/site-packages/numpy/core/numeric.py", line 2377, in isclose
    xfin = isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The error is because numpy cannot convert the object type to an integer safely for comparison.

Expected behavior

The dataframe returned from cdf should contains floats not objects.

Environment

pip freeze.

anyio==3.7.1
backoff==2.2.1
certifi==2023.7.22
cfgv==3.4.0
coverage==7.3.0
distlib==0.3.7
exceptiongroup==1.1.3
execnet==2.0.2
filelock==3.12.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
identify==2.5.27
idna==3.4
iniconfig==2.0.0
joblib==1.3.2
nodeenv==1.8.0
numpy==1.24.4
packaging==23.1
pandas==2.0.3
platformdirs==3.10.0
pluggy==1.3.0
pre-commit==3.3.3
pytest==7.4.0
pytest-cov==4.1.0
pytest-randomly==3.15.0
pytest-timeout==2.1.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
scikit-base==0.5.1
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
skpro @ file:///home/alex/documents/skpro
sniffio==1.3.0
threadpoolctl==3.2.0
tomli==2.0.1
tzdata==2023.3
virtualenv==20.24.3

OS:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Additional context

[BUG] sphinx build failures with json parse error

The sphinx doc build seems to fail with json parse errors.

I don't think anything was changed that would cause this, it's a bit mysterious.

My suspicion is, based on it being the only json file in the docs afaik, the file docs/source/_static/switcher.json, but all seems ok there?

FYI @yarnabrina, @duydl, in case you have any quick spots.

[BUG] cyclic boosting - sporadic test failures due to convergence failure

The recently added cyclic boosting estimator sporadically fails tests due to failed convergence of the loss, e.g.,:

FAILED skpro/regression/tests/test_cyclic_boosting.py::test_cyclic_boosting_with_manual_paramaters - cyclic_boosting.utils.ConvergenceError: Your cyclic boosting training seems to be diverging. In the 9. iteration the current loss: 52.52700124396056, is greater than the trivial loss with just mean predictions: 20.816666666666666.

FYI @setoguchi-naoki, @FelixWick

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.