yu9824 / kennard_stone Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 0.0 4.5 MB

This is an algorithm for evenly partitioning.

Home Page: https://pypi.org/project/kennard-stone/

License: MIT License

Python 100.00%

scikit-learn train-test-split kfold-cross-validation python

kennard_stone's Issues

metric='nan_euclidean' does not work with input contains NaN

In version 2.10, when input contains NaN, it raises ValueError.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:12

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:142, in train_test_split(test_size, train_size, *arrays, **kwargs)
    136 indexes = ks.get_indexes(X)[0]
    138 n_samples = _num_samples(X)
    139 n_train, n_test = _validate_shuffle_split(
    140     n_samples,
    141     self.test_size,
--> 142     self.train_size,
    143     default_test_size=self._default_test_size,
    144 )
    146 for _ in range(self.get_n_splits()):
    147     ind_test = indexes[:n_test]

File ~/mambaforge/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1689, in BaseShuffleSplit.split(self, X, y, groups)
   1659 """Generate indices to split data into training and test set.
   1660 
   1661 Parameters
   (...)
   1686 to an integer.
   1687 """
   1688 X, y, groups = indexable(X, y, groups)
-> 1689 for train, test in self._iter_indices(X, y, groups):
   1690     yield train, test

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:75, in _iter_indices(self, X, y, groups)
     41 def __init__(
     42     self,
     43     n_splits: int = 5,
   (...)
     47     **kwargs,
     48 ) -> None:
     49     """K-Folds cross-validator using the Kennard-Stone algorithm.
     50 
     51     Parameters
     52     ----------
     53     n_splits : int, optional
     54         Number of folds. Must be at least 2., by default 5
     55 
     56     metric : str, optional
     57         The distance metric to use. See the documentation of
     58         `sklearn.metrics.pairwise_distances` for valid values.
     59         , by default "euclidean"
     60 
     61         =============== ========================================
     62         metric          Function
     63         =============== ========================================
     64         'cityblock'     metrics.pairwise.manhattan_distances
     65         'cosine'        metrics.pairwise.cosine_distances
     66         'euclidean'     metrics.pairwise.euclidean_distances
     67         'haversine'     metrics.pairwise.haversine_distances
     68         'l1'            metrics.pairwise.manhattan_distances
     69         'l2'            metrics.pairwise.euclidean_distances
     70         'manhattan'     metrics.pairwise.manhattan_distances
     71         'nan_euclidean' metrics.pairwise.nan_euclidean_distances
     72         =============== ========================================
     73 
     74     n_jobs : int, optional
---> 75         The number of parallel jobs., by default None
     76     """
     77     super().__init__(n_splits=n_splits, shuffle=False, random_state=None)
     78     self.metric = metric

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:167, in get_indexes(self, X)
    152 @overload
    153 def train_test_split(
    154     *arrays,
   (...)
    158     n_jobs: Optional[int] = None,
    159 ) -> list:
    160     pass
    163 def train_test_split(
    164     *arrays,
    165     test_size: Optional[Union[float, int]] = None,
    166     train_size: Optional[Union[float, int]] = None,
--> 167     metric: str = "euclidean",
    168     n_jobs: Optional[int] = None,
    169     **kwargs,
    170 ) -> list:
    171     """Split arrays or matrices into train and test subsets using the
    172     Kennard-Stone algorithm.
    173 
   (...)
    224     ValueError
    225     """
    226     if "shuffle" in kwargs:

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    915         raise ValueError(
    916             "Found array with dim %d. %s expected <= 2."
    917             % (array.ndim, estimator_name)
    918         )
    920     if force_all_finite:
--> 921         _assert_all_finite(
    922             array,
    923             input_name=input_name,
    924             estimator_name=estimator_name,
    925             allow_nan=force_all_finite == "allow-nan",
    926         )
    928 if ensure_min_samples > 0:
    929     n_samples = _num_samples(array)

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    144 if estimator_name and input_name == "X" and has_nan_error:
    145     # Improve the error message on how to handle missing values in
    146     # scikit-learn.
    147     msg_err += (
    148         f"\n{estimator_name} does not accept missing values"
    149         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    159         "#estimators-that-handle-nan-values"
    160     )
--> 161 raise ValueError(msg_err)

ValueError: Input contains NaN.

Latest Release breaks compatibility with Python 3.7

The latest changes in v2 are no longer compatible with Python 3.7. Run this example in an environment with Python 3.7 installed to reproduce the error:

>>> from kennard_stone import train_test_split
>>> import numpy as np

>>> train_test_split(np.array([1,2,3]).reshape(-1, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 142, in train_test_split
    train, test = next(cv.split(X=arrays[0]))
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1600, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 75, in _iter_indices
    indexes = ks.get_indexes(X)[0]
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 194, in get_indexes
    X=X, lst_idx_selected=lst_idx_selected, idx_remaining=idx_remaining
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 211, in _sort
    sum(lst_idx_selected, start=[])
TypeError: sum() takes no keyword arguments

This change in the call to sum should be reverted, or Python 3.7 should be dropped from the list of supported Python versions.

Very slow with large amount of data

Hello yu9824,
The process of train_test_split is quite slow with thousands of data.
With 500 data the split takes only a few seconds, but with 2000 data it takes 7 minutes.
The split process seems to use only 1 CPU core.
I look forward to performance optimizations.

Make conda installation as well

First of all, thank you for writing this library. It is very helpful. But please make it available on conda as well. If you need my help to publish it in the conda-forge channel, I will send a PR.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.