yu9824 / kennard_stone Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 0.0 4.5 MB

This is an algorithm for evenly partitioning.

Home Page: https://pypi.org/project/kennard-stone/

License: MIT License

Python 100.00%

scikit-learn train-test-split kfold-cross-validation python

kennard_stone's Introduction

Kennard Stone

What is this?

This is an algorithm for evenly partitioning data in a scikit-learn-like interface. (See References for details of the algorithm.)

How to install

PyPI

pip install kennard-stone

The project site is here.

Anaconda

conda install -c conda-forge kennard-stone

The project site is here.

You need numpy>=1.20 and scikit-learn to run.

How to use

You can use them like scikit-learn.

See examples for details.

In the following, X denotes an arbitrary explanatory variable and y an arbitrary objective variable. And, estimator indicates an arbitrary prediction model that conforms to scikit-learn.

train_test_split

kennard_stone

from kennard_stone import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scikit-learn

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=334
)

KFold

kennard_stone

from kennard_stone import KFold

# Always shuffled and uniquely determined for a data set.
kf = KFold(n_splits=5)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

scikit-learn

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=334)
for i_train, i_test in kf.split(X, y):
    X_train = X[i_train]
    y_train = y[i_train]
    X_test = X[i_test]
    y_test = y[i_test]

Other usages

If you ever specify cv in scikit-learn, you can assign KFold objects to it and apply it to various functions.

An example is cross_validate.

kennard_stone

from kennard_stone import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5)
print(cross_validate(estimator, X, y, cv=kf))

scikit-learn

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

kf = KFold(n_splits=5, shuffle=True, random_state=334)
print(cross_validate(estimator, X, y, cv=kf))

from sklearn.model_selection import cross_validate

print(cross_validate(estimator, X, y, cv=5))

Notes

There is no notion of random_state or shuffle because the partitioning is determined uniquely for the dataset. If these arguments are included, they do not cause an error. They simply have no effect on the result. Please be careful.

If you want to run the notebook in examples directory, you will need to additionally install pandas, matplotlib, seaborn, tqdm, and jupyter other than the packages in requirements.txt.

Distance metrics

See the documentation of

scipy.spatial.distance.pdist https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
sklearn.metrics.pairwise_distances https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

Valid values for metric are:

From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']. These metrics support sparse matrix inputs. ['nan_euclidean'] but it does not yet support sparse matrices.
From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'] See the documentation for scipy.spatial.distance for details on these metrics. These metrics do not support sparse matrix inputs.

, by default "euclidean"

Parallelization (since v2.1.0)

This algorithm is very computationally intensive and takes a lot of time. To solve this problem, I have implemented parallelization and optimized the algorithm since v2.1.0. n_jobs can be specified for parallelization as in the scikit-learn-like api.

# parallelization KFold
kf = KFold(n_splits=5, n_jobs=-1)

# parallelization train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, n_jobs=-1
)

The parallelization is used when calculating the distance matrix, so it doesn't conflict with something like cross_validate in parallel when using KFold.

# OK: does not conflict each other
cross_validate(estimator, X, y, cv=KFold(5, n_jobs=-1), n_jobs=-1)

Using GPU

If you have a GPU and have installed pytorch, you can use it to calculate Minkowski distances (Manhattan, Euclidean, and Chebyshev distances).

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, device="cuda"
)

LICENSE

MIT Licence

References

Papers

R. W. Kennard & L. A. Stone (1969) Computer Aided Design of Experiments, Technometrics, 11:1, 137-148, DOI: 10.1080/00401706.1969.10490666

Sites

https://datachemeng.com/trainingtestdivision/ (Japanese site)

Histories

v2.0.0 (deprecated)

Define Extended Kennard-Stone algorithm (multi-class) i.e. Improve KFold algorithm.
Delete alternate argument in KFold.
Delete requirements of pandas.

v2.0.1

Fix bug with Python3.7.

v2.1.0 (deprecated)

Optimize algorithm
Deal with Large number of data.
- parallel calculation when calculating distance (Add n_jobs argument)
- replacing recursive functions with for-loops
Add other than "euclidean" calculation methods (Add metric argument)

v2.1.1 (deprecated)

Fix bug when metric="nan_euclidean".

v2.1.2 (deprecated)

Fix details.
- Update docstrings and typings.

v2.1.3 (deprecated)

Fix details.
- Update some typings. (You have access to a list of strings that can be used in the metric.)

v2.1.4

Fix bug when metric=="seuclidean" and "mahalanobis"
- Add some tests to check all metrics.
Add requirements numpy>=1.20

v2.1.5

Delete "klusinski" metric to support scipy>=1.11

v2.1.6

Improve typing in kennard_stone.train_test_split
Add some docstrings.

v2.2.0

Supports GPU calculations. (when metric is 'euclidean', 'manhattan', 'chebyshev' and 'minkowski')
Supports Python 3.12

v2.2.1

Fix setup.cfg
Update 'typing'

kennard_stone's People

Contributors

Stargazers

Watchers

kennard_stone's Issues

Very slow with large amount of data

Hello yu9824,
The process of train_test_split is quite slow with thousands of data.
With 500 data the split takes only a few seconds, but with 2000 data it takes 7 minutes.
The split process seems to use only 1 CPU core.
I look forward to performance optimizations.

metric='nan_euclidean' does not work with input contains NaN

In version 2.10, when input contains NaN, it raises ValueError.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:12

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:142, in train_test_split(test_size, train_size, *arrays, **kwargs)
    136 indexes = ks.get_indexes(X)[0]
    138 n_samples = _num_samples(X)
    139 n_train, n_test = _validate_shuffle_split(
    140     n_samples,
    141     self.test_size,
--> 142     self.train_size,
    143     default_test_size=self._default_test_size,
    144 )
    146 for _ in range(self.get_n_splits()):
    147     ind_test = indexes[:n_test]

File ~/mambaforge/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1689, in BaseShuffleSplit.split(self, X, y, groups)
   1659 """Generate indices to split data into training and test set.
   1660 
   1661 Parameters
   (...)
   1686 to an integer.
   1687 """
   1688 X, y, groups = indexable(X, y, groups)
-> 1689 for train, test in self._iter_indices(X, y, groups):
   1690     yield train, test

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:75, in _iter_indices(self, X, y, groups)
     41 def __init__(
     42     self,
     43     n_splits: int = 5,
   (...)
     47     **kwargs,
     48 ) -> None:
     49     """K-Folds cross-validator using the Kennard-Stone algorithm.
     50 
     51     Parameters
     52     ----------
     53     n_splits : int, optional
     54         Number of folds. Must be at least 2., by default 5
     55 
     56     metric : str, optional
     57         The distance metric to use. See the documentation of
     58         `sklearn.metrics.pairwise_distances` for valid values.
     59         , by default "euclidean"
     60 
     61         =============== ========================================
     62         metric          Function
     63         =============== ========================================
     64         'cityblock'     metrics.pairwise.manhattan_distances
     65         'cosine'        metrics.pairwise.cosine_distances
     66         'euclidean'     metrics.pairwise.euclidean_distances
     67         'haversine'     metrics.pairwise.haversine_distances
     68         'l1'            metrics.pairwise.manhattan_distances
     69         'l2'            metrics.pairwise.euclidean_distances
     70         'manhattan'     metrics.pairwise.manhattan_distances
     71         'nan_euclidean' metrics.pairwise.nan_euclidean_distances
     72         =============== ========================================
     73 
     74     n_jobs : int, optional
---> 75         The number of parallel jobs., by default None
     76     """
     77     super().__init__(n_splits=n_splits, shuffle=False, random_state=None)
     78     self.metric = metric

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:167, in get_indexes(self, X)
    152 @overload
    153 def train_test_split(
    154     *arrays,
   (...)
    158     n_jobs: Optional[int] = None,
    159 ) -> list:
    160     pass
    163 def train_test_split(
    164     *arrays,
    165     test_size: Optional[Union[float, int]] = None,
    166     train_size: Optional[Union[float, int]] = None,
--> 167     metric: str = "euclidean",
    168     n_jobs: Optional[int] = None,
    169     **kwargs,
    170 ) -> list:
    171     """Split arrays or matrices into train and test subsets using the
    172     Kennard-Stone algorithm.
    173 
   (...)
    224     ValueError
    225     """
    226     if "shuffle" in kwargs:

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    915         raise ValueError(
    916             "Found array with dim %d. %s expected <= 2."
    917             % (array.ndim, estimator_name)
    918         )
    920     if force_all_finite:
--> 921         _assert_all_finite(
    922             array,
    923             input_name=input_name,
    924             estimator_name=estimator_name,
    925             allow_nan=force_all_finite == "allow-nan",
    926         )
    928 if ensure_min_samples > 0:
    929     n_samples = _num_samples(array)

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    144 if estimator_name and input_name == "X" and has_nan_error:
    145     # Improve the error message on how to handle missing values in
    146     # scikit-learn.
    147     msg_err += (
    148         f"\n{estimator_name} does not accept missing values"
    149         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    159         "#estimators-that-handle-nan-values"
    160     )
--> 161 raise ValueError(msg_err)

ValueError: Input contains NaN.

Make conda installation as well

First of all, thank you for writing this library. It is very helpful. But please make it available on conda as well. If you need my help to publish it in the conda-forge channel, I will send a PR.

Latest Release breaks compatibility with Python 3.7

The latest changes in v2 are no longer compatible with Python 3.7. Run this example in an environment with Python 3.7 installed to reproduce the error:

>>> from kennard_stone import train_test_split
>>> import numpy as np

>>> train_test_split(np.array([1,2,3]).reshape(-1, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 142, in train_test_split
    train, test = next(cv.split(X=arrays[0]))
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1600, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 75, in _iter_indices
    indexes = ks.get_indexes(X)[0]
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 194, in get_indexes
    X=X, lst_idx_selected=lst_idx_selected, idx_remaining=idx_remaining
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 211, in _sort
    sum(lst_idx_selected, start=[])
TypeError: sum() takes no keyword arguments

This change in the call to sum should be reverted, or Python 3.7 should be dropped from the list of supported Python versions.

yu9824 / kennard_stone Goto Github PK

kennard_stone's Introduction

Kennard Stone

What is this?

How to install

PyPI

Anaconda

How to use

train_test_split

kennard_stone

scikit-learn

KFold

kennard_stone

scikit-learn

Other usages

kennard_stone

scikit-learn

Notes

Distance metrics

Parallelization (since v2.1.0)

Using GPU

LICENSE

References

Papers

Sites

Histories

v2.0.0 (deprecated)

v2.0.1

v2.1.0 (deprecated)

v2.1.1 (deprecated)

v2.1.2 (deprecated)

v2.1.3 (deprecated)

v2.1.4

v2.1.5

v2.1.6

v2.2.0

v2.2.1

kennard_stone's People

Contributors

Stargazers

Watchers

kennard_stone's Issues

Recommend Projects

Recommend Topics

Recommend Org