sktime / skpro Goto Github PK

A unified framework for tabular probabilistic regression and probability distributions in python

Home Page: https://skpro.readthedocs.io/en/latest

License: BSD 3-Clause "New" or "Revised" License

Python 99.74% Makefile 0.16% Shell 0.05% Dockerfile 0.06%

python sklearn data-science framework prediction machine-learning ai probabilistic-models probability-distributions regression

skpro's Introduction

🚀 Version 2.4.1 out now! Read the release notes here..

skpro is a library for supervised probabilistic prediction in python. It provides scikit-learn-like, scikit-base compatible interfaces to:

tabular supervised regressors for probabilistic prediction - interval, quantile and distribution predictions
tabular probabilistic time-to-event and survival prediction - instance-individual survival distributions
metrics to evaluate probabilistic predictions, e.g., pinball loss, empirical coverage, CRPS, survival losses
reductions to turn scikit-learn regressors into probabilistic skpro regressors, such as bootstrap or conformal
building pipelines and composite models, including tuning via probabilistic performance metrics
symbolic probability distributions with value domain of pandas.DataFrame-s and pandas-like interface

Overview
Open Source
Tutorials
Community
CI/CD
Code
Downloads
Citation

📚 Documentation

Documentation
⭐ Tutorials	New to skpro? Here's everything you need to know!
📋 Binder Notebooks	Example notebooks to play with in your browser.
👩‍💻 User Guides	How to use skpro and its features.
✂️ Extension Templates	How to build your own estimator using skpro's API.
🎛️ API Reference	The detailed reference for skpro's API.
🛠️ Changelog	Changes and version history.
🌳 Roadmap	skpro's software and community development plan.
📝 Related Software	A list of related software.

💬 Where to ask questions

Questions and feedback are extremely welcome! We strongly believe in the value of sharing help publicly, as it allows a wider audience to benefit from it.

skpro is maintained by the sktime community, we use the same social channels.

Type	Platforms
🐛 Bug Reports	GitHub Issue Tracker
✨ Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	GitHub Discussions · Stack Overflow
💬 General Discussion	GitHub Discussions
🏭 Contribution & Development	`dev-chat` channel · Discord
🌐 Community collaboration session	Discord - Fridays 13 UTC, dev/meet-ups channel

💫 Features

Our objective is to enhance the interoperability and usability of the AI model ecosystem:

skpro is compatible with scikit-learn and sktime, e.g., an sktime proba forecaster can be built with an skpro proba regressor which in an sklearn regressor with proba mode added by skpro
skpro provides a mini-package management framework for first-party implemenentations, and for interfacing popular second- and third-party components, such as cyclic-boosting or MAPIE packages.

skpro curates libraries of components of the following types:

Module	Status	Links
Probabilistic tabular regression	maturing	Tutorial · API Reference · Extension Template
Time-to-event (survival) prediction	maturing	API Reference · Extension Template
Performance metrics	maturing	API Reference
Probability distributions	maturing	Tutorial · API Reference · Extension Template

⏳ Installing `skpro`

To install skpro, use pip:

pip install skpro

or, with maximum dependencies,

pip install skpro[all_extras]

Releases are available as source packages and binary wheels. You can see all available wheels here.

⚡ Quickstart

Making probabilistic predictions

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from skpro.regression.residual import ResidualDouble

# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_new, y_train, _ = train_test_split(X, y)

# step 2: specifying the regressor - any compatible regressor is valid!
# example - "squaring residuals" regressor
# random forest for mean prediction
# linear regression for variance prediction
reg_mean = RandomForestRegressor()
reg_resid = LinearRegression()
reg_proba = ResidualDouble(reg_mean, reg_resid)

# step 3: fitting the model to training data
reg_proba.fit(X_train, y_train)

# step 4: predicting labels on new data

# probabilistic prediction modes - pick any or multiple

# full distribution prediction
y_pred_proba = reg_proba.predict_proba(X_new)

# interval prediction
y_pred_interval = reg_proba.predict_interval(X_new, coverage=0.9)

# quantile prediction
y_pred_quantiles = reg_proba.predict_quantiles(X_new, alpha=[0.05, 0.5, 0.95])

# variance prediction
y_pred_var = reg_proba.predict_var(X_new)

# mean prediction is same as "classical" sklearn predict, also available
y_pred_mean = reg_proba.predict(X_new)

Evaluating predictions

# step 5: specifying evaluation metric
from skpro.metrics import CRPS

metric = CRPS()  # continuous rank probability score - any skpro metric works!

# step 6: evaluat metric, compare predictions to actuals
metric(y_test, y_pred_proba)
>>> 32.19

👋 How to get involved

There are many ways to get involved with development of skpro, which is developed by the sktime community. We follow the all-contributors specification: all kinds of contributions are welcome - not just code.

Documentation
💝 Contribute	How to contribute to skpro.
🎒 Mentoring	New to open source? Apply to our mentoring program!
📅 Meetings	Join our discussions, tutorials, workshops, and sprints!
👩‍🔧 Developer Guides	How to further develop the skpro code base.
🏅 Contributors	A list of all contributors.
🙋 Roles	An overview of our core community roles.
💸 Donate	Fund sktime and skpro maintenance and development.
🏛️ Governance	How and by whom decisions are made in sktime's community.

👋 Citation

To cite skpro in a scientific publication, see citations.

skpro's People

Contributors

Stargazers

Watchers

Forkers

hengqujushi rsantana-isg gridl shafiahmed dailyactie jdetras miguelperalvo dansanz btcton viktorkaz vishalbelsare anirband mloning jesellier passion4energy ahmad-abdellatif alex-jg3 ram0nb setoguchi-naoki valeman drasaadmoosa githubceomichael fnhirwa bhavikar04 an20805 nanamikato nilesh05apr shreesham07 vandit98 julian-fong sanjayk0508 sukjingitsit duydl mobley-trent malikrafsan vascosch92 sairevanth25 m6129 meraldoantonio mspil88 qxzsilver1

skpro's Issues

[ENH] interface `XGBoostLSS` et al by `StatMixedML`

It would be great to interface the various probabilistic supervised regressors of StatMixedML, so they can then immediately used for forecasting in sktime via skpro!

XGBoostLSS https://github.com/StatMixedML/XGBoostLSS
LightGBMLSS https://github.com/StatMixedML/LightGBMLSS
CatBoostLSS https://github.com/StatMixedML/CatBoostLSS
pyboostLSS https://github.com/StatMixedML/Py-BoostLSS

FYI @StatMixedML, @joshdunnlime

Many thanks to @KiwiAthlete for the suggestion!

[BUG] check missing test coverage for `BaseDistribution.pdfnorm` fallback

As per #204, the fallback default for BaseDistribution.pdfnorm seems not to have been tested, it should be checked why.

[ENH] MAPIE interface

MAPIE is one of the more popular "favourite algorithm" repositories in the probabilistic supervised learning space.

Their scope is more geared towards multiple tasks (incl classification, time series forecasting, etc), whereas skpro has a more stringent framework layer imo.

The specific estimator that would be great to interface is the MapieRegressor:
https://github.com/scikit-learn-contrib/MAPIE/blob/master/mapie/regression/regression.py

should be straightforward with the extension template.
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py

FYI @gmartinonQM, @thibaultcordier, @LacombeLouis, @vincentblot28, @vtaquet, @SZiane in case there might be interest in close collaboration - I see you are planning to add further models, and you are also following closely sklearn as a design template, so perhaps we should merge base class architectures?

For context, skpro is a 2017 package for proba supervised (tabular) regression which is one of the design antecessors of sktime. Parts of it became the sktime probabilistic forecasting module, now it has been rearchitected using skbase. Of course a number of proba supervised regression packages have seen the light of day since 2017.

sktime and skpro adopt a framework integration philosophy where third party packages can be easily interfaced and used as components in pipelines etc, we manage dependencies on the level of estimators.
https://www.sktime.net/en/stable/

For instance, skpro probabilistic tabular regressors can be used in sktime probabilistic forecasting pipelines.

[ENH] simple pipeline compatible with `sklearn`

The package should include a simple linear pipeline that can make use of sklearn transformers, which gives for free feature union, variable subseteting, etc.

[ENH] multiple quantile regression

Is your feature request related to a problem? Please describe.

For quantile regression, often more than one quantile probability is of interest. However, existing Sklearn compatible quantile regressors always fit and predict a single quantile probability. To the best of my knowdlegde, there is no standardized way to integrate multiple quantile regression with Sktime/Skpro probabilistic prediction methods such as predict_quantiles/predict_intervals.

Describe the solution you'd like

New Skpro regressor that wraps multiple quantile regressors and supports probabilistic predictions from wrapped regressors.

Describe alternatives you've considered

None, as the proposed solution is already discussed in Sktime issue sktime/sktime#5357

[ENH] explicit/analytic form of energy function for chi-squared distribution

There does not seem to exist a literature reference for the energy functionals of the chi-squared distribution, we should try to derive it, or find a reference.

Collecting discussion below, from #22.

Current state:

explicit formula for the cross-term $\mathbb{E}[|X-y|]$ derived, but not posted yet
no progress yet on the self-term $\mathbb{E}[|X-X'|]$

@sukjingitsit has been working on this.

Would you like to post your partial progress?

Distributions as return objects

Re-opening the sub-issue opened in #3 and commented upon by @murphyk

Question: should skpro's predict methods return a vector of distribution objects?
For example, using the distributions from scipy.stats which implement methods pdf, cdf, mean, var, etc.

Pro:

this would be using an existing, consolidated, and well-supported interface
it might be easier to use
it might be easier to understand

Contra:

mixture types are not supported
l2 norm is not supported (as would be needed for squared/Gneiting loss)
mixed distributions on the reals, especially empirical distributions (weighted sum of deltas) which are returned by Bayesian packages are not supported
vectors of distributions are not supported, alternatively Cartesian products of distributions
this is not the status quo

[ENH] Calibration plots for probabilistic predictions

@benHeid made nice calibration plots here: sktime/sktime#5632

So I wonder:

should these also be added in skpro for probabilistic tabular regressors? Same principle works.
any reasons why it could not be shifted via copy/paste and import change?

Mid-term, it might be best to have them only in skpro, and make sktime forecasting rely more on skpro distributions machinery and utilities.

FYI @benHeid, any thoughts?

[ENH] further improvements on `BaseDistribution` and tests

From implementing a few distributions, some insights on potential improvements:

the broadcasting logic could/should be abstracted to apply to a certain subset of parameters rather than be copy-pasted in every class
there is repeated boilerplate, this could be resolved via split into public/private methods such as pdf/_pdf, similar to sktime's fit/_fit
we should add logical consistency tests, e.g., cdf being inverse to ppf, or pdf and log_pdf being compatible, or more sophisticated tests such as subtracting the mean shifting functions (or not affecting them) accordingly
worth thinking about: logical consistency tests of MC approximations against exact implementation, although that can eat a lot of runtime (so perhaps not?)
some distributions, such as empirical or composites, need careful thinking about the subsetting logic - so an extension contract for iloc and loc indexing needs to be thought out so it can affect parameters
row/column subsetting via loc, iloc, should be tested

[ENH] interface to cyclic boosting package

It would be nice to interface cyclic_boosting, which provides implementations of the cyclic boosting regressors.
Blue-Yonder-OSS/cyclic-boosting#56

They are popular in proba forecasting, but by nature are actually proba supervised regerssion algoritihms so "belong" in skpro, from where it can be interfaced by sktime for forecasting.

The interface to implement is in this extension template:
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py

FYI @rijkvandermeulen, @FelixWick

[ENH] support for row multi-index in distributions

Distributions should be able to support a row multi-index, i.e., pd.MultiIndex - as in sktime hierarchical and panel data.

This is probably automatic for most distributions, but not immediate for, e.g., Empirical, where parameterization is in terms of additional index levels.

In addition, testing should be carried out with multi-index examples in the general suite.

Related: PR in sktime which adds multiindex support to Empirical
sktime/sktime#6066

[ENH] adding truncated means as an interface point?

API design and math question: would it make sense to add truncated means as general interface points? Or, more generally, truncated moments?

API design-wise, having it would help with:

computation of energy, one of the terms is explicit in cdf and truncated means, see here: #22 (comment)
truncation compositor, which then would have access to more analytical formulae

questions:

if we add it, how? I would add additional arguments to the existing mean method.

Math-wise, the questions are:

how easy are truncated mean and moments to obtain - matching against current literature coverage, and "potential" coverage in terms of analytic forms
what are the analytic relations between these, and other distribution defining functions or methods currently present?

umbrella issue - `scikit-base` based rearchitecture for `skpro` v2

1st part: rearchitecting on scikit-base #9 -
this unlocks creating new estimators, distributions, and moving legacy estimators to scikit-base compatible interface
2nd part to cover probabilistic metrics and example notebooks
this unlocks more metrics and moving legacy metrics #11
3rd part - predict_interval, predict_quantiles etc, as in sktime #18

Moving individual estimators from legacy to scikit-base interface

complete missing features in residual two-step
bagging estimator
density/distribution estimator naive baseline (should include kernel and empirical)
composite parametric
bounding wrapper
grid and random tuning (requires metrics)

Moving distirbutions from legacy to scikit-base interface

Moving metrics from legacy to scikit-base interface - can be reused from sktime

CRPS
log-loss
capped log-loss
squared loss

Related: old refactor effort (superseded)
#6

[ENH] explicit/analytic form of energy function for log-normal distribution

There does not seem to exist a literature reference for the energy functionals of the log-normal distribution, we should try to derive it, or find a reference.

Collecting discussion below, from #214.

Current state:

explicit formula for the cross-term $\mathbb{E}[|X-c|]$ (almost) derived
no progress yet on the self-term $\mathbb{E}[|X-X'|]$

@bhavikar04 used Wolfram Alpha to derive the following indefinite integral related to the cross-term $\mathbb{E}[X-c]$:

My reply:
this looks correct. Now you need to add the limits. That should be an easy substitution, no? I recommend, do that manually. Use that

$\lim_{x\rightarrow -\infty} \mbox{erf}(x) = -1$, and $\lim_{x\rightarrow \infty} \mbox{erf}(x) = 1$. You need to be careful with the sign, but that should be it?

The number 0.707 etc should be $\frac{1}{2} \sqrt{2}$, but it doesn't matter for the limits.

First example: 'utils' not found

The first example in your documentation (DensityBaseline) does not run right on my machine: it throws a 'module not found' exception at the call to 'utils'.

This might be a python version problem (I am using 3.6), so perhaps it's not an error in the normal sense - though I don't see any specification that the package required a particular python version. Apologies if I missed it: in any case, I fixed it by importing matplotlib instead: i.e.

import matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)

instead of:

import utils
utils.plot_performance(y_test, y_pred)

[ENH] probabilistic `TransformedTargetRegressor`

The package should contain a TransformedTargetRegressor that is applicable to probabilistic regressors.

Most likely this would make use of #30.

For some distribution/transform combination, there might be bettere dispatchable transformed distributions.

[BUG] incorrectly low coverage reports

Coverage reports seem to give low percentages - even in cases where it should be clear that the estimator is tested.

Perpahs the coverage does not recognize the lines due to the incremental and matix testing?

[ENH] quantile crossing handling in multiple quantile regression

We should think how to handle quantile crossing in multiple quantile regression.

Original post of @Ram0nB:

Good one. We could provide the user a "method" parameter similar to Sktime's Imputer, provide the user a quantile_crossing_callable similar to Sktime's FunctionTransformer, or both. What are your thoughts on this @fkiraly ? Maybe we can open an enhancement issue for this for now?

Originally posted by @Ram0nB in #107 (comment)

[ENH] refactor - move `utils.git_diff` to `skbase`

Currently, utils.git_diff is a copy of the one in sktime, which is not DRY.

Both instances of the git_diff module should move to skbase, and be imported from there.

Possibly also includinng the fixture generator for the test classes.

After removal in the original packages, the skbase lower bound should be raised to guarantee presence of the import.

[DOC] extension template for probability distributions

We should write an extension template for probability distributions - possibly we may like to do this after some of the improvements in #21
which will change the extender contract slightly.

This should look similar to the sktime extension templates:
https://github.com/sktime/sktime/tree/main/extension_templates

From @Alex-JG3 in #21, it is important to highlight the interplay between the tags (what is exact/approximate?) and the option to not implement certain methods: There are four cases:

method must be implemented, or one out of k methods must be implemented
method need not be implemented as there is a default, but can be, to replace a less accuate or less efficient default

combined with:

method results in numerically exact return
method return is approximate

[BUG] differential/incremental testing of estimators does not work

It seems the incremental testing does not properly work in skpro, unlike in sktime.

I don't understand why it doesn't. This leads to test times of 7min instead of 2min for individual workflow elements.

Failed attempts to fix:
#127

[ENH] plotting of `BaseDistribution` objects

It would be neat if there is a simple yet powerful visualization routine for BaseDistribution objects. This is not easy to come up with, as these objects take values in pandas-like tables, i.e., there is a univariate distribution marginal per table cell.

Some discussion around it in the ridgeplot repo: tpvasconcelos/ridgeplot#173, tpvasconcelos/ridgeplot#171
opening this issue to link and track.

[ENH] interface probabilistic regressors from `ngboost` package

Update by @fkiraly - we should interface ngboost regressors as probabilistic supervised (tabular) regressors.
As discussed in the below, this should be an skpro regressor, which in turn can be used in sktime reduction forecasters (such as YfromX). The original request was for using ngboost as a forecaster in sktime, but this is a tabular proba regressor that needs to go through an additional reduction step (which is now implemented in sktime).

It should be straightforward to interface ngboost using the probabilistic regressor extension template:
https://github.com/sktime/skpro/blob/main/extension_templates/regression.py
so adding it as a good first issue.

The main techincal concern might be translating the ngboost probability distributions into skpro probability distributions, but that should also be addressable with a lean adaptation layer (personally, I would add that adaptation in an adapter utility subpackage in regression.ngboost).

Original request below.

Is your feature request related to a problem? Please describe.
Can we build a Probabilistic Forecaster using Ngboost. A Probabilistic regression like Ngboost method will give us the confidence intervals out of the box unlike others where we need a heuristic(like quantile loss) to get the values of Conf/Pred Ints.

Describe the solution you'd like
A rough sketch:

class ProbForecaster:
     def __init__():
           self.estimator_ = NgboostRegressor()
     def fit(y,X=None):
           y = make_reduction(y)
           self.estimator_.fit(y)
     def predict(fh,X=none,alpha=0.95,conf_int=True):
           mean_preds =  self.estimator_.predict(y)
           distribution =  self.estimator_.pred_dist(y)
           conf_ints = scipy.norm.interval(alpha, distribution.loc, distribution.scale )
           return mean_preds,conf_ints

Describe alternatives you've considered
Using Xgboost/LightGBM with quantile loss makes the calculation of prediction intervals inefficient.

Additional context
Ngboost Doc- https://stanfordmlgroup.github.io/projects/ngboost/

[BUG] QPD distribution and `CyclicBoosting` API non-compliance

@setoguchi-naoki, @FelixWick, I have to apologize in advance for this.

Due to an error in the test logic introduced in an update to class retrieval (see #189), probability distributions went uncovered for most of your PR's duration.

This is my fault for not noticing, and breaking it in the first place.

Bad news is that the QPD distribution have a few non-compliances:

cdf method
mean, var
subsetting

CyclicBoosting as well:

_predict_quantiles arg should be alpha, not quantiles

Good news is that this has not been released yet, and I was running more tests before the 2.2.0 release, which is what ultimately caught the problem.

I'll simply wait with 2.2.0 until we had time to fix this - happy to help.

For testing locally, you need to depend on the branch #189 until it is merged.

[ENH] roadmap of probabilistic regressors to implement or to interface

A wishlist for probabilistic regression methods to implement or interface.
This is partly copied from the list I made when designing the R counterpart mlr-org/mlr3proba#32 .
Number of stars at the end is estimated difficulty or time investment.

GLM

generalized linear model(s) with continuous regression link, e.g., Gaussian *
- Gaussian link, statsmodels
- further regression links: Gamma, Tweedie, inverse Gaussian
generalized linear model(s) with count link, e.g., Poisson *
- Poisson link, statsmodels
- Poisson link, sklearn
- further links: Binomial
heteroscedastic linear regression ***
Bayesian GLM where conjugate priors are available, e.g., GLM with Gaussian link ***

KRR aka Gaussian process regression

vanilla kernel ridge regression with fixed kernel parameters and variance *
kernel ridge regression with MLE for kernel parameters and regularization parameter **
heteroscedastic KRR or Gaussian processes ***

CDE

variants of conditional density estimation (Nadaraya-Watson type) **
reduction to density estimation by binning of input variables, then apply unconditional density estimation **

Gradient boosting and tree-based

ngboost package interface *
probabilistic residual boosting **
probabilistic regression trees **

Neural networks

interface tensorflow probability - some hard-coded NN architectures **
generic tensorflow probability interface - some hard-coded NN architectures ***

Bayesian toolboxes

generic pymc3 interface ***
generic pyro interface ****
generic Stan interface ****
generic JAGS interface ****
generic BUGS interface ****
generic Bayesian interface - prior-valued hyperparameters *****

Pipeline elements for target transformation

distr fixed target transformation **
distr predictive target calibration **

Composite techniques, reduction to deterministic regression

stick mean, sd, from a deterministic regressor which already has these as return types into some location/scale distr family (Gaussian, Laplace) *
use model 1 for the mean, model 2 fit to residuals (squared, absolute, or log), put this in some location/scale distr family (Gaussian, Laplace) **
upper/lower thresholder for a regression prediction, to use as a pipeline element for a forced lower variance bound **
generic parameter prediction by elicitation, output being plugged into parameters of a distr object not necessarily scale/location ****
reduction via bootstrapped sampling of a determinstic regressor **

Ensembling type pipeline elements and compositors

simple bagging, averaging of pdf/cdf **
probabilistic boosting ***
probabilistic stacking ***

baselines

always predict a Gaussian with mean = training mean, var = training var *
unconditional densities via distfit package, interface *
IMPORTANT as featureless baseline: reduction to distr/density estimation to produce an unconditional probabilistic regressor **
IMPORTANT as deterministic style baseline: reduction to deterministic regression, mean = prediction by det.regressor, var = training sample var, distr type = Gaussian (or Laplace) **

Other reduction from/to probabilistic regression

reducing deterministic regression to probabilistic regression - take mean, median or mode **
reduction(s) to quantile regression, use predictive quantiles to make a distr ***
reducing deterministic (quantile) regression to probabilistic regression - take quantile(s) **
reducing interval regression to probabilistic regression - take mean/sd, or take quantile(s) **
reduction to survival, as the sub-case of no censoring **
reduction to classification, by binning ***

[ENH] update `evaluate` and tuning to survival models

The evaluation utilities and tuners for probabilistic regression work with survival prediction models introduced in #157, but do not pass C through.

They should be updated - before the release containing #157 - to pass C through to models fitted on resamples.

[ENH] Add Lambert W x F distributions

Is your feature request related to a problem? Please describe.

For modeling skewed and/or heavy-tailed distributions i'd like to have support for Lambert W x F distributions. On top of modeling, Lambert W x F distribution allow to "Gaussianize" the observed data.

This is especially useful / prevalent for financial time series data, which is often skewed and/or heavy-tailed.

Describe the solution you'd like

This exists in the LambertW R package and the pylambertw Python module, which is an sklearn transformer/estimator wrapper around torchlambertw.

Describe alternatives you've considered

Other heavy-tailed distributions; but none of the typical ones allow the ease of itnerpretation of the heavy-tail parameter, the input/output system view of transformation, and a bijective back-transformation.

Additional context

see here for a detailed discussion with references / screenshots etc.
StatMixedML/XGBoostLSS#55
StatMixedML/XGBoostLSS#65 (comment)

I'd be happy to open a PR to implement a first version of Lambert W x Gaussian distributions, but would like some guidance/pointers on best practices for skpro.

[ENH] refactor - move `_check_soft_dependencies` to `skbase`

Currently, _check_soft_dependencies is a copy of the one in sktime, which is not DRY.

Both instances of _check_soft_dependencies should move to skbase, and be imported from there.

After removal in the original packages, the skbase lower bound should be raised to guarantee presence of the import.

[ENH] track outcome of probabilistic prediction interface discussion on `sklearn`

We should track the outcome of the probabilistic prediction interface discussion on sklearn:
scikit-learn/scikit-learn#23334

and potentially action depending on the outcome.

documentation: np.mean(y_pred) does not work

I'm following along with this intro example.. However this line fails

(numpy.mean(y_pred) * 2).shape

Error below (seems to be because Distribution objects don't support the mean() function but instead insist on obscurely calling it point!)

np.mean(y_pred)
Traceback (most recent call last):

  File "<ipython-input-38-19819be87ab5>", line 1, in <module>
    np.mean(y_pred)

  File "/home/kpmurphy/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2920, in mean
    out=out, **kwargs)

  File "/home/kpmurphy/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py", line 75, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)

TypeError: unsupported operand type(s) for +: 'Distribution' and 'Distribution'

problem in loading the skpro

It has been 2 days that I am trying to import skpro. But I can not I keep getting this error:

cannot import name 'six' from 'sklearn.externals' (C:\Users\My Book\anaconda3\lib\site-packages\sklearn\externals_init_.py)

skpro-refactoring (version-2)

See below some comments/description of the coming refactoring contents :

Distribution classes refactoring in a more OOD way (see. skpro->distribution)
Losse functions (see. metrics->distribution)
Estimators (see. metrics->distribution)

Some descriptive notebooks (in docs->notebooks) and a full set of unit test (in tests) are also available.

[BUG] The `cdf` for the `Empirical` distribution returns a dataframe of objects

Describe the bug

The cdf function for the Empirical distribution returns a dataframe of python objects. It should return a dataframe of floats or integers.

To Reproduce

>>> import pandas as pd
>>> import numpy as np
>>> from skrpo.distributions.empirical import Empirical
>>> spl_idx = pd.MultiIndex.from_product([[0, 1], [0, 1, 2]], names=["sample", "time"])
>>> spl = pd.DataFrame([[0, 1], [2, 3], [10, 11], [6, 7], [8, 9], [4, 5]], index=spl_idx, columns=["a", "b"])
>>> d = Empirical(spl)
>>> x = pd.DataFrame([[1, 2], [3, 4], [5, 6]], index=pd.Index([0, 1, 2]), columns=pd.Index(['a', 'b']))
>>> d.cdf(x)
     a    b
0  0.5  0.5
1  0.5  0.5
2  0.5  0.5
>>> d.cdf(x).values.dtype
dtype('O')
>>> d.cdf(x).dtypes
a    Float64
b    Float64

Expected behavior

The dataframe returned from cdf should contain floats of integers. I don't think Float64 is a valid pandas datatype - it should be float64. Here is an example creating a simple float dataframe.

>>> y = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]])
>>> y.dtypes
0    float64
1    float64
dtype: object
>>> y.values.dtype
dtype('float64')

Environment

Pip freeze.

anyio==3.7.1
backoff==2.2.1
certifi==2023.7.22
cfgv==3.4.0
coverage==7.3.0
distlib==0.3.7
exceptiongroup==1.1.3
execnet==2.0.2
filelock==3.12.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
identify==2.5.27
idna==3.4
iniconfig==2.0.0
joblib==1.3.2
nodeenv==1.8.0
numpy==1.24.4
packaging==23.1
pandas==2.0.3
platformdirs==3.10.0
pluggy==1.3.0
pre-commit==3.3.3
pytest==7.4.0
pytest-cov==4.1.0
pytest-randomly==3.15.0
pytest-timeout==2.1.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
scikit-base==0.5.1
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
skpro @ file:///home/alex/documents/skpro
sniffio==1.3.0
threadpoolctl==3.2.0
tomli==2.0.1
tzdata==2023.3
virtualenv==20.24.3

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Additional context

This problem arose when I was trying to write a test to check if cdf is the inverse of ppf.

[BUG] location/scale not used in `TDistribution`

Location and scale are not used in TDistribution, incorrectly.

Mirrors sktime/sktime#5918 and sktime/sktime#5919 as reported, and the fix sktime/sktime#5942 by @ivarzap.

Since skpro's distribution module is currently an 1:1 copy - due to pre-refactor state - we should also make the same changes to skpro.

If you would be so kind, @ivarzap, that would be nice - otherwise someone else will copy (and credit you)

Enhance metrics module through properscoring package?

We should consider using properscoring as it seems to provide quite efficient implementations of loss metrics that could enhance our metrics module

[DOC] replace Boston housing dataset in examples

Note to self that we should replace the Boston housing dataset used in the old legacy examples.

This is due to the racist assumptions in generation of the "squared difference of fraction of Blacks from value pessimum" variable which is also discussed here at its original source in sklearn:
https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.load_boston.html

An alternative dataset could be California housing or diabetes.

[ENH] grid and random tuning

The package should include a grid and random tuning utility that can make use of predict_proba and probabilistic metrics already refactored.

This should hopefully be a minor adaptation exercise using the abstractions in sktime.

[ENH] ensure sortedness consistency in indices and test

We need to ensure consistency in various indices occurring:

return of predict, predict_proba indices
output of distribution methods' indices
input to distribution constructors, especially when nested

with respect to being sorted.

There is the question about the desired end state - I think indices should not automatially get sorted, and be consistent especially when unsorted.

This is due to sorting being a non-trivial operation that can bump prediction cost from O(n_test) to O(n_test log(n_test)) scaling (or worse in case of nestings).

[ENH] roadmap of probability distributions to implement

It would be great to have a basic set of probability distributions implemented.

Umbrella issue for implementing sktime probability distributions.

Recipe: use the extension_templates/distribution.py extension template.
Examples:

Normal, for de-novo implementations or manual interfaces
Fisk, for interfacing scipy distributions - this is much easier than using the full template

High priority:

laplace - #19
empirical incl delta - #25
multivariate normal

mid priority:

t-distribution - #49
Cauchy (special case of t) - #49
mixture composition - #26
truncation, compositor

low priority:

lower priority:

list of many more (lowest priority)
https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions - can be interfaced via _ScipyDist adapter easily!
https://en.wikipedia.org/wiki/File:ProbOnto2.5.jpg

Mirrors sktime/sktime#4518
(for high and mid priority)

Contributions can be made to either repository, and should be copied over to the other once approved/merged, until the modules are merged into one.

[BUG] readthedocs failures

Since recently, readthedocs builds fail with a json problem.

There was indeed a , missing in switcher.json, but that did not affect the 2.0.1 release, so I'm not sure what is going on here.

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/events.py", line 97, in emit
    results.append(listener.handler(self.app, *args))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/pydata_sphinx_theme/__init__.py", line 99, in update_config
    switcher_content = json.loads(content)
                       ^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/.asdf/installs/python/3.11.4/lib/python3.11/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 12 column 3 (char 226)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/cmd/build.py", line 293, in build_main
    app = Sphinx(args.sourcedir, args.confdir, args.outputdir,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/application.py", line 272, in __init__
    self._init_builder()
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/application.py", line 343, in _init_builder
    self.events.emit('builder-inited')
  File "/home/docs/checkouts/readthedocs.org/user_builds/skpro/envs/121/lib/python3.11/site-packages/sphinx/events.py", line 108, in emit
    raise ExtensionError(__("Handler %r for event %r threw an exception") %
sphinx.errors.ExtensionError: Handler <function update_config at 0x7f511f19b600> for event 'builder-inited' threw an exception (exception: Expecting ',' delimiter: line 12 column 3 (char 226))

Extension error (pydata_sphinx_theme):
Handler <function update_config at 0x7f511f19b600> for event 'builder-inited' threw an exception (exception: Expecting ',' delimiter: line 12 column 3 (char 226))

[ENH] energy for multivariate probability distributions with arbitrary k-norm

Currently, the energy method, if called on multivariate distributions (multiple columns), implements the 1-norm energy which is not strictly proper - as it is just the sum of marginal energies, and therefore is minimized by any multivariate distribution that has correct marginals (non-uniquely!). This is due to the default handling of multivariate which is column averaging or summation.

k-norms with k>1 are strictly proper afaik, but they do not fit the current interface which assumes column means/sums.

We may have to add a param to the energy function, or even a new method for multivariate energy - not sure what the best is design-wise.

The key issue is that the 1-norm and 2-norm energies often have closed form solutions or at least known ones that are efficient to compute, whereas other k may or may not have these.

For the extender contract and tag inspection, it means we must be able to cope with the situation where we may want to implement efficient special cases and leave the other cases to approximate routines.

Any good ideas? (@Alex-JG3, @frthjf)

[BUG] differential/incremental testing of estimators does not seem to work

differential/incremental testing of estimator does not seem to work - still, all estimators are being tested.

This is odd, as I am quite sure that it did work today for a while.

I'm not sure what has changed here so it no longer works.

[ENH] adapter for `scipy` distributions

Adapting scipy distributions is very formulaic and could easily be dealt with by an adapter class. This also avoids duplication in implementation efforts, as scipy is a core dependency.

This may require some abstraction around methods, but it seems mostly like a delegator class, as scipy does its own broadcasting.

Further discrepancies to be mindful of:

energy and similar integrals are not implemented in scipy
the class parameterization is different, scipy uses class methods whereas skpro uses __init__ based parameterization

Good first issue with a design/architecture flavour, can be leveraged to cover a lot of ground in #22.

[ENH] conda release

skpro is not yet available on conda, but should be.

Before we move some proba functionality from sktime into skpro, it must be available on conda.

[ENH] transformed distribution

The package should contain a transformed distribution, after applying a fitted sklearn transformer's transform or inverse_transform.

As these can be arbitrary, it will probably be necessary to estimate most distribution methods by Monte Carlo.

This will be useful in a probabilistic version of TransformedTargetRegressor.

[BUG] CDF for empirical distribution returns dataframe of objects not floats.

Describe the bug

The cdf function for the Empirical distribution returns daframes with cells that are of data type object.

To Reproduce

This example is taken from the get_test_params for the Empirical distribution.

>>> from skpro.distributions.empirical import Empirical
>>> import pandas as pd
>>> spl_idx = pd.MultiIndex.from_product(
...     [[0, 1], [0, 1, 2]], names=["sample", "time"]
... )
>>> spl = pd.DataFrame(
...     [[0, 1], [2, 3], [10, 11], [6, 7], [8, 9], [4, 5]],
...     index=spl_idx,
...     columns=["a", "b"],
... )
>>> spl.values.dtype
dtype('int64')
>>> x = pd.DataFrame(
...     [[0, 1], [2, 3], [10, 11]],
...     index=pd.Index([0, 1, 2]),
...     columns=pd.Index(['a', 'b'])
... )
>>> d = Empirical(spl)
>>> d.cdf(x)
     a    b
0  0.5  0.5
1  0.5  0.5
2  1.0  1.0
>>> d.cdf(x).values.dtype
dtype('O')

This problem arose when I was trying to write a test for checking if the ppf is the inverse of the cdf. Here is an example of what this test might look like and the error I get.

>>> import numpy as np
>>> x
    a   b
0   0   1
1   2   3
2  10  11
>>> d
Empirical(columns=Index(['a', 'b'], dtype='object'),
          index=Index([0, 1, 2], dtype='int64', name='time'),
          spl=              a   b
sample time
0      0      0   1
       1      2   3
       2     10  11
1      0      6   7
       1      8   9
       2      4   5)
>>> cdf = d.cdf(x)
>>> cdf.values.dtype
dtype('O')
>>> x_approx = d.ppf(cdf)
>>> x.values.dtype
dtype('int64')
>>> x_approx.values.dtype
dtype('O')
>>> np.allclose(x_approx, x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 200, in allclose
  File "/home/alex/documents/skpro/.venv/lib/python3.10/site-packages/numpy/core/numeric.py", line 2270, in allclose
    res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
  File "<__array_function__ internals>", line 200, in isclose
  File "/home/alex/documents/skpro/.venv/lib/python3.10/site-packages/numpy/core/numeric.py", line 2377, in isclose
    xfin = isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The error is because numpy cannot convert the object type to an integer safely for comparison.

Expected behavior

The dataframe returned from cdf should contains floats not objects.

Environment

pip freeze.

anyio==3.7.1
backoff==2.2.1
certifi==2023.7.22
cfgv==3.4.0
coverage==7.3.0
distlib==0.3.7
exceptiongroup==1.1.3
execnet==2.0.2
filelock==3.12.2
h11==0.14.0
httpcore==0.17.3
httpx==0.24.1
identify==2.5.27
idna==3.4
iniconfig==2.0.0
joblib==1.3.2
nodeenv==1.8.0
numpy==1.24.4
packaging==23.1
pandas==2.0.3
platformdirs==3.10.0
pluggy==1.3.0
pre-commit==3.3.3
pytest==7.4.0
pytest-cov==4.1.0
pytest-randomly==3.15.0
pytest-timeout==2.1.0
pytest-xdist==3.3.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
scikit-base==0.5.1
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
skpro @ file:///home/alex/documents/skpro
sniffio==1.3.0
threadpoolctl==3.2.0
tomli==2.0.1
tzdata==2023.3
virtualenv==20.24.3

OS:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Additional context

[ENH] set up allcontributors workflow

We should set up the allcontributors workflow that creates the contributors.md in the root dir.

[BUG] sphinx build failures with json parse error

The sphinx doc build seems to fail with json parse errors.

I don't think anything was changed that would cause this, it's a bit mysterious.

My suspicion is, based on it being the only json file in the docs afaik, the file docs/source/_static/switcher.json, but all seems ok there?

FYI @yarnabrina, @duydl, in case you have any quick spots.

[BUG] cyclic boosting - sporadic test failures due to convergence failure

The recently added cyclic boosting estimator sporadically fails tests due to failed convergence of the loss, e.g.,:

FAILED skpro/regression/tests/test_cyclic_boosting.py::test_cyclic_boosting_with_manual_paramaters - cyclic_boosting.utils.ConvergenceError: Your cyclic boosting training seems to be diverging. In the 9. iteration the current loss: 52.52700124396056, is greater than the trivial loss with just mean predictions: 20.816666666666666.

FYI @setoguchi-naoki, @FelixWick

[BUG] fix logic for multiple scorers in `evaluate` utility

As a lift/shift of the sktime evaluate utility, skpro's utility is equally broken for the case of multiple scorers, see sktime/sktime#5167 for the sktime problem.

This should be fixed by a parallel patch, once sktime has fixed the evaluate utility.

sktime / skpro Goto Github PK

skpro's Introduction

📚 Documentation

💬 Where to ask questions

💫 Features

⏳ Installing skpro

⚡ Quickstart

Making probabilistic predictions

Evaluating predictions

👋 How to get involved

👋 Citation

skpro's People

Contributors

Stargazers

Watchers

Forkers

skpro's Issues

Recommend Projects

Recommend Topics

Recommend Org

⏳ Installing `skpro`