erdogant / distfit Goto Github PK

distfit is a python library for probability density fitting.

Home Page: https://erdogant.github.io/distfit

License: Other

Python 27.81% Shell 0.02% Jupyter Notebook 72.17%

probability-distribution fitting-curve hypothesis-testing sse probability-statistics pdf density-functions pypi cumulative-distribution-function kolmogorov-smirnov

distfit's Introduction

Hi there! I am sharing my knowledge with the world through my blogs and open-source GitHub projects.

Your ❤️ is important to keep maintaining my packages. It is awsome that there are readily millions of downloads but to keep the libraries alive, I often need to make all kinds of fixes. This can eat up my entire weekend and evenings. Yes, I do this for free and in my free time! There are various ways you can help. You can report bugs/ issues, or even better help out with fixing bugs or maybe adding new features! If you don't have the time or maybe you are still learning, you can also take a Medium Membership using my referral link to keep reading all my hands-on blogs and learn more :-) If you don't need that, there is always an easy way with Coffee :-) Cheers!

A structured list of my repos

All Repos can be found in the Repositories section. If Sphinx pages are available, the link will directly go to the documentation pages.

Statistics	Machine learning	(Time)Series	Vizualization	Utils	API
bnlearn	clusteval	findpeaks	d3graph	df2onehot	googletrends
hnet	classeval	temporalrank	d3heatmap	pypickle	slacki
distfit	hgboost	caerus	treeplot	ismember
pca	clustimage		kaplanmeier	irelease
thompson	undouble		flameplot	pypiplot
benfordslaw			worldmap	dicter
			colourmap
			imagesc
			scatterd
			d3blocks

Find here my Pypi download stats

Overview of open issues

distfit's People

Contributors

Stargazers

Watchers

distfit's Issues

Feature request: parallel execution when fitting multiple distributions

As of now distfit does not take advantage of multicore HW. While parallelism in Python is a bit tricky I think that parallel execution when fitting multiple distributions could be fairly easy and very beneficial improvement.

A code snippet for `plotting` the best fit and the original data

Thank you for this useful and practical package! Would it be possible to include the feature to plot the best-fit model over the original data? I hope my tentative solution might help other users.

The best-fit model and the actual data can be plotted using the code below.

from distfit import distfit

fig, ax = plt.subplots()
dist = distfit(alpha=0.05, smooth=10)
dist.fit_transform(data)
best_distr = dist.model
exec(f"from scipy.stats import {best_distr['name']}")
ax.hist(data, density=True, label="data")
data_range = np.linspace(data.min(), data.max())
ax.plot(data_range, getattr(scipy.stats, best_distr['name']).pdf(data_range, *best_distr['params']), color='r', lw=4, label=f"{best_distr['name']} ({best_distr['stats']}={best_distr['score']:.3f})")
ax.legend()

Integration with `sktime`, `skpro`?

@erdogant, I was wondering whether you would be interested to actively contribute to integration with sktime and skpro?
https://github.com/sktime/sktime
https://github.com/sktime/skpro

sktime is currently the most widely used sklearn-like framework package for time series. skpro is an similar project around tabular modelling with probability distributions, such as tabular supervised probabilistic regression, or conditional density/distribution estimation. It integrates with sktime, for things like probabilistic forecasting.

distfit would fit nicely, as its distribution estimation capabilities are broad, and provide some required components for things like anomaly detction - tabular and time series - and probabilistic regression. For instance, one could imagine it being used as the probability estimating component in a probabilistic forecaster.

I was planning a simple interfacing (which you're welcome to review or contribute to), but we could consider closer integration, I'd be happy to contribute, for instance:

moving distfit towards more object oriented structure and scikit-learn like interface, similar to skpro.distributions which is using skbase for an sklearn-like interface for distributions. I believe this is also the same that @Roman223 is suggesting in #44
collaborating on skpro native distributions, ensuring we sync the large collection of distributions available in distfit, scipy, with an object oriented interface like in skpro. We may have to redesign some aspects of it so it satisfies your requirements for fitting.

What do you think?

I am not sure of the best way to chat, but you are cordially invited to the sktime discord and its channels dedicated to probability modelling: https://discord.com/invite/54ACzaFsn7

Save best parameters

Hello,
your package really useful, thanks a lot!

I have a question:
If I want to print the best parameters, what's the syntax?
For example, I want to print the best n and p for binomial distribution for the following work.

thanks a lot

2D distribution fit

This is great work.
What about 2D distribution fitting ?
Like 2D bi-variate gaussian ?

Adjusting the bin size of plots

Is it possible to adjust the bin size of the histograms?

Two questions about distfit

This project looks really great, thank you. I have two questions:

How do you set loc = 0 if you know that is the right value for it? I am trying to fit to a symmetric distribution.
When I try distfit with distr='full' it gets stuck at levy_l. Is this expected?

More theoretical distributions can be added, such as zero truncated and zero modified theoretical distributions

In insurance or other scenarios, we need to fit with zero truncated and zero modified theoretical distributions.

KS-test in fitdist

Hello everyone,

I noticed in the code erdogant/distfit/distfit.py that whenever you use the KS statistical test (stats=ks), you call the scipy.stats.ks_2samp to test your data against the distribution you estimated through MLE (maximum likelihood estimation). Is that true? If so, this is wrong, because now the KS statistic depends on your data and the test is no longer valid. In such a case, I would recommend you to have a look at parametric/non-parametric bootstrapping to solve the issue. This reference could be useful https://ui.adsabs.harvard.edu/abs/2006ASPC..351..127B/abstract

Can I use the best distribution as the true distribution of my data?

Here I used distfit to get a distribution that is the closest to my data，but not exactly。When I use the kstest from the scipy library to calculate the p-value to see if I can trust the distribution, the p-value is not ideal.Can I still use distfit to get a distribution to describe my data ?

Quantile Bootstrap to avoid computationally expensive simple Bootstrap

Hello,

Maybe it can be helpful for you, the new version of fast Bootstrap implemented by the Spotify team. It can be beneficial to avoid computationally expensive bootstraps in distfit

Article: https://arxiv.org/pdf/2202.10992.pdf

Blog: https://engineering.atspotify.com/2022/03/comparing-quantiles-at-scale-in-online-a-b-testing/

Discrete data fit

Thanks a lot for this useful tool. But if my data is not continuous but discrete, ex Integers, what should I do ?

plot_summary throws error

This is a great library, find it quite useful.

The following is a potential bug that needs to be addressed.
distfit.plot_summary() throws the error - set_tick() got an unexpected keyword argument 'rotation'

----stack trace---

TypeError Traceback (most recent call last)
in
----> 1 empDist.plot_summary()

c:\users\iyerram\appdata\local\programs\python\python37\lib\site-packages\distfit\distfit.py in plot_summary(self, n_top, figsize, ylim, fig, ax, grid, color_y1, color_y2, verbose)
1100
1101 # You can specify a rotation for the tick labels in degrees or with keywords.
-> 1102 ax.set_xticks(xcoord, df['name'].values, rotation=45)
1103
1104 # Pad margins so that markers don't get clipped by the axes

c:\users\iyerram\appdata\local\programs\python\python37\lib\site-packages\matplotlib\axes_base.py in wrapper(self, *args, **kwargs)
61
62 def wrapper(self, *args, **kwargs):
---> 63 return get_method(self)(*args, **kwargs)
64
65 wrapper.module = owner.module

c:\users\iyerram\appdata\local\programs\python\python37\lib\site-packages\matplotlib\cbook\deprecation.py in wrapper(*args, **kwargs)
449 "parameter will become keyword-only %(removal)s.",
450 name=name, obj_type=f"parameter of {func.name}()")
--> 451 return func(*args, **kwargs)
452
453 return wrapper

TypeError: set_ticks() got an unexpected keyword argument 'rotation'

image:

new feature request - automatically fit multiple variables

I recommend dfit.fit_transform(X) be extended to include multiple variables. Each variable will be fitted individually.

matrix rows = samples
matrix columns = features (variables)

The proposed functionality mirrors the popular scikit-learn API. Here is an example of that API: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Also, parallel processing across a multi-core CPU would be an awesome enhancement! :-)

Guillaume Lemaitre (https://github.com/glemaitre) committed code for sklearn.utils.parallel. He is a developer for the scikit-learn foundation. He may be a good contact on how best to implement parallel processing in Python in 2023.

Add K distribution

What a really awesome repository !

By the way, K distribution is widely used in the filed of Radar and sonar. It is necessary to estimate the parameters of the K distribution.

Please consider adding this distribution if possible.

Logo has a typo

It says “denstity”, it should say “density”. I’d like to recommend this library to people but the typo may undercut credibility. Thanks for the hard work making this! 🙏

pip install bnlearn?

Great package! I didn't know about this! ⭐
I think you have one typo in the README: pip install bnlearn when it should be pip install distfit :)

Boostrapping bug - The data contains non-finite values

The bootstrap call ends with The data contains non-finite values. exception. While when used by using n_jobs > 10 in distfit constructor this does not occur on the same dataset. I've included code for reproduction complete with test data in the linked zip file.

repro.zip

(The dataset contains only positive finite floats.)

Typo in docs

Link

dist.distributions is a list containing the extracted pdfs from scripy

scipy -> scripy

Support scientific notation of title

If we fit the data which has small values, then the loc and scale are all zero.

>>> print(dist.model)

{'distr': <scipy.stats._continuous_distns.lognorm_gen object at 0x7f3ed6e8e9a0>, 'stats': 'RSS', 'params': (0.16244674220470803, -2.9009343867951488e-05, 3.154691365893712e-05), 'name': 'lognorm', 'model': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3ec551f490>, 'score': 2201474051.037151, 'loc': -2.9009343867951488e-05, 'scale': 3.154691365893712e-05, 'arg': (0.16244674220470803,), 'CII_min_alpha': -4.859578914018569e-06, 'CII_max_alpha': 1.2200488331361844e-05}

The `distr` parameter should accept a list

The distr parameter in your core distfit class should accept a custom list of distributions that the user wants to run fitting on. Is there a specific reason you have not allowed it to accept a list?

Incorporate numba with scipy.special and numba-stats

There may be some potentially significant speed improvements by running code that is compiled. At first glance, it seems that there doesn't exist a fit method in numba-stats so that might be a significant hurdle and not all the distributions are available that way so this is kind of a place holder or an aspirational enhancement.

New methods to assess the goodness-of-fit

Greetings,

First of all, thank your amazing library!

I wonder, are there any plans of adding new methods to assess the goodness-of-fit? S.a. you mention in the docs:

"""
such as the maximum likelihood estimation (mle), moment matching estimation (mme), quantile matching estimation (qme) or maximizing goodness-of-fit estimation (mge)
"""

Regards

Bug(?): lognorm distribution with negative loc parameter

Thank you for this extremely helpful package, which I found over a medium post and recommendations by a colleague.
Since I discovered distfit I was eager to try the parametric approach in fitting PDFs.
Below you can find an example for mockup data.

import numpy as np
np.random.seed(4)

x_sim = np.random.normal(loc=47.55,scale=13.8, size = 10000)
x_sim = np.append([*filter(lambda x: x<=80, x_sim)],np.random.normal(loc=90,scale=10, size = 50))
x_sim = np.array([*filter(lambda x: x >=0,x_sim)])

x=x_sim

dfit = distfit('parametric', todf=True, distr=["lognorm"])
dfit.fit_transform(x)
dfit.bootstrap(x, n_boots=100)
fig, ax = plt.subplots(1,3, figsize=(20,8))
sns.histplot(x,ax=ax[0])
dfit.plot("PDF",n_top=3,fontsize=11,ax=ax[1])
dfit.plot("CDF",n_top=3,fontsize=11,ax=ax[2])
plt.show()

I was kind of surprised by the negative location parmeter (of $-822.$) for the lognorm distribution. I might missunderstand what the loc parameter means here?
Also I was not quite able to reproduce the plots I obtained from my actual data with mock data.
For the true data I often got PDFs in the form below (despite the histogram sometimes following nicely a nearly perfect bell shape). Unfortunately I cannot provide the data.

fig, ax = plt.subplots(1,1)
ax.vlines(3000,ymax=0.05,ymin=0, color = "red", linestyle="--")
ax.vlines(0,ymax=0.05,ymin=0, color = "red", linestyle="--")
ax.set_ylim((0,0.05))
ax.vlines(5,ymin=0,ymax=0.008,linewidth=3, color = "black")
ax.hlines(xmin=5,xmax=350,y=0,linewidth=7, color = "black")

Minor Points

In the legend of the PDF Plot the best fitting distribution is capitalized. For consistency (e.g., with plot_summary() or with dfit.plot("CDF")) it might be advisable to keep it just lowercase letters

Add loggamma

I have a problem where loggamma fits best. I ran your script and my own custom script, they agree on beta parameters but the loggamma seemed much more natural. If it's not too much trouble, please consider adding this. If you are using scipy.stats, then it's the same API as others.

Cool project.

Library decomposition

Despite of well structure inside the only one .py file, it is actually not a good practice.

I faced an issue on my own and made a comparison of my code and library to figure it out (the problew was on my code). But during this process I've faced difficulties in code pattern, I had to use search in code instead of refereneces in PyCharm to navigate.

My suggestion is to use any OOP architecture. It seems it will not be too complex, since code already follows functional pattern (even blueprints for protected methods!).

If the mainainer will agree with this, I'd like to work with this.

Fitting distribution for discrete/categorical data

Is it possible to fit a distribution with distfit library for a discrete variable? For example, let's say I have a survey that has 10 questions with possible values that go from 1 (poor) to 5 (excellent), and 100 persons take the survey.

Best regards

New User Plot Question

Greetings

I am a new user read about it here "https://towardsdatascience.com/how-to-find-the-best-theoretical-distribution-for-your-data-a26e5673b4bd"

I am also reading the information here "https://erdogant.github.io/distfit/pages/html/index.html"

The code runs, but no PDF is being generated and I get an odd message (don't know if it is related) : [colourmap]> Warning: Colormap [Set1] can not create [11] unique colors! Available unique colors: [9]

Question:
Is there any special settings for creating the PDF?
How can I set the output location of the PDF?

This is the sample I ran

from distfit import distfit
import numpy as np

# Example data
X = np.random.normal(10, 3, 2000)
y = [3, 4, 5, 6, 10, 11, 12, 18, 20]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize
dfit = distfit(todf=True)

# Search for best theoretical fit on your empirical data
results = dfit.fit_transform(X)
dfit.plot()

Regards

plot freeze

On windows 10, python 3.10, pycharm Build #PC-231.9161.41, built on June 20, 2023

After

import numpy as np
from distfit import distfit
X = np.random.normal(0, 2, 10000)
y = [-8,-6,0,1,2,3,4,5,6]

Initialize

dfit = distfit(alpha=0.01)

Fit

dfit.fit_transform(X)

Plot seperately

fig, ax = dfit.plot(chart='pdf')

I get a blank plot freezed!
I can't resize or other. If I close plot I get:
Process finished with exit code -805306369 (0xCFFFFFFF)

Help needed to interpret loc and scale

Thank you for the great library.

I have been using it to fit my data and have the following great result. However, while I can fully understand the a=12.8672 and b=12.86672, I am have difficulty interpreting what loc and scale are. From the document, it seems to be mean and std-dev.

I know the distribution mean is at 25 and std-dev=5.4645.

Are you able to show me how I should convert the loc and scale to get mean=25 and std-dev=5.4645 ?.

Many thanks in advance.

I have a confusion about dist.results

The output of dist.results has four columns, I want to know the difference between column "y_pred" and column "P". What I understand is that column "P" represents pvalue and column "y_pred" represents fdr corrected pvalue, please tell me that my understanding is right or not.

Robustness of selected data models

Good day!

Guys, I have found your package really cool) Thanks a lot)

I have a question:

Our incoming data can be with anomalies, noise. So, quality of our results is vulnerable to strong/weak outliers.
Work with outliers is key feature of your package. Consequently, the quality of predictions based on our data model can be severely compromised. In a sense, we are training and predicting from the same data.

What is your advice?

I understand that, it is largely dependent on and provided by the nature of one or another theoretical distribution of data.

But, better to know, your personal opinion as authors...

Plotting fit takes too much time

A result of distribution fitting takes ~ 3 minutes to plot. Using this becomes unfeasible for multiple distributions fitting.

Support Gaussian Mixture distribution fitting (2GMM, 3GMM..)

i really like this package, thanks for the implementation and support @erdogant

Is there any plans to support Gaussian Mixture distributions as well?
I think it would be really helpful for me and other people as well to have it supported in the list of distributions in distfit.

i think there is an implementation in sklearn for GMM, the question is if it can be integrated in distfit...

example, reference: https://stats.stackexchange.com/questions/499130/fitting-mixture-model-of-gaussians-and-uniform-distributions-to-real-data

Statistical distance measures?

Cool project. Did you consider adding common distance scores like wasserstein_distance to determine the best fit as well?

New `anomaly` method?

Is it possible to create a utility method 'anomaly with the existing functions you already have? This is for anomaly detection in 1-D time series or sensor data streams, for which your package is ideally suited. Basically, you can use the predict method and put a small wrapper around it like this,

def anomaly(arr, prob_thresh=0.001):
    """
    Predicts (binary) whether data in a given array are anomalies or not 
    based on a prob threshold cutoff value. If the data point has a high probability 
    of coming from the fitted distribution, then return 'False' i.e. not an anomaly, else 'True'
    """
    preds = dist.predict(arr)
    return preds['y_proba'] < prob_thresh

extend init.py descriptions

I am not sure if this qualifies as an issue, but the distfit help on IPython (such as "distfit??") returns a minimum set of information.
Would be nice (and helpful) to include p.e. a list of all the methods included and some more extended documentation.
I may be glad to help p.e. if the raw doc is available somewhere and need to be formatted and included in the file.
thanks!

`generate` or `rvs` method?

Do you plan to have a generate or rvs method added to a fitted dist class to generate a given number (chosen by a size parameter) of new points with the best-fitted distribution? Here is the imagined code (say I have a dataset called dataset)

dist = distfit(todf=True)
dist.fit_transform(dataset)

# Newly generated 1000 points from the best-fitted distribution (based on some score criteria)
new_data = dist.generate(size=1000)

Error from scipy.optimize within fit_transform

I think this error is coming from the comparison on different fits in fit_transform()

ValueError: The function value at x=9.999999999999999e+306 is NaN; solver cannot continue.

See the example below.

import numpy as np
from distfit import distfit

data = np.array([ 56.518556,  54.803739,  57.424846,  54.254221,  63.235301,
        55.815964,  56.557519,  56.789227,  55.710028,  55.348868,
        55.998148,  54.88984 ,  60.698556,  58.037249,  55.998148,
        56.196659,  54.07792 ,  54.324279,  55.279121,  55.85325 ,
        54.677967,  54.330469,  54.122291,  54.819475,  54.565095,
       117.236973,  54.512653,  56.638532,  53.162648,  54.602637,
        56.66363 ,  56.934138,  59.085959,  57.303842,  58.934084,
       183.203797, 110.220432,  57.52065 ,  54.509817, 129.639834,
        69.668429, 126.631612, 148.791635,  85.291877, 145.450409,
        58.601726,  83.397137,  62.084062,  54.81671 ,  59.890595,
        95.307584,  63.366694,  54.16292 , 151.382722,  58.215827,
       147.623722, 119.041469, 114.503229,  66.526126, 138.969765,
       135.064926, 146.308008, 139.331183, 125.15503 , 150.57275 ,
        59.308423,  58.144718,  57.447888, 149.112722, 142.705007,
       133.288753, 141.13678 , 147.519795, 140.110123, 152.173592,
       146.103209, 147.683985, 147.416646, 150.066857, 142.576063,
       144.83517 , 145.818179, 145.499275, 145.83333 , 145.298108,
       146.885261, 145.397002, 145.282957, 145.648398, 145.884035,
       146.648887])

marg_dists = ['gennorm',
              'genlogistic',
              'mielke',
              'johnsonsu',
              'burr',
              'johnsonsb',
              'loggamma',
              'norminvgauss']

distfit(distr=marg_dists).fit_transform(data)

module 'scipy.stats' has no attribute 'gilbrat'

I've tried to upgrade SciPy but it hasn't worked.
Error in
dfit = distfit(distr='full',stats='energy',todf=True)

My code bog down on drawing pdf if best fit distribution is T distribution

Use distfit to find the best theoretical distribution for the claim cost data
dfit = distfit(todf=True)
result = dfit.fit_transform(cost)
dfit.plot(chart='pdf')

Plots are not generated

Hi,

Both dist.plot() and dist.plot_summary() do not generate plots for me. I am using the bare version of Python (i.e no Conda etc.)

Am I missing somethings?

Regards,

Danish

T Distribution Weirdness

We are using distfit to try to determine if some data we have can be modelled parametrically. For some of the data, the best fitting distribution was a t. Scale and loc are clearly documented, and that is great. There is one remaining parameter to fit a t distribution, and that is degrees of freedom. Except, the one parameter in the distfit output that isn't a scale or loc value is less than one. Obviously, degrees of freedom can't be less than one. So what is that parameter and why isn't degrees of freedom included in the output? It would be helpful for automating our process.