Giter Site home page Giter Site logo

yu9824 / kennard_stone Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 0.0 4.5 MB

This is an algorithm for evenly partitioning.

Home Page: https://pypi.org/project/kennard-stone/

License: MIT License

Python 100.00%
scikit-learn train-test-split kfold-cross-validation python

kennard_stone's Issues

metric='nan_euclidean' does not work with input contains NaN

In version 2.10, when input contains NaN, it raises ValueError.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:12

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:142, in train_test_split(test_size, train_size, *arrays, **kwargs)
    136 indexes = ks.get_indexes(X)[0]
    138 n_samples = _num_samples(X)
    139 n_train, n_test = _validate_shuffle_split(
    140     n_samples,
    141     self.test_size,
--> 142     self.train_size,
    143     default_test_size=self._default_test_size,
    144 )
    146 for _ in range(self.get_n_splits()):
    147     ind_test = indexes[:n_test]

File ~/mambaforge/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1689, in BaseShuffleSplit.split(self, X, y, groups)
   1659 """Generate indices to split data into training and test set.
   1660 
   1661 Parameters
   (...)
   1686 to an integer.
   1687 """
   1688 X, y, groups = indexable(X, y, groups)
-> 1689 for train, test in self._iter_indices(X, y, groups):
   1690     yield train, test

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:75, in _iter_indices(self, X, y, groups)
     41 def __init__(
     42     self,
     43     n_splits: int = 5,
   (...)
     47     **kwargs,
     48 ) -> None:
     49     """K-Folds cross-validator using the Kennard-Stone algorithm.
     50 
     51     Parameters
     52     ----------
     53     n_splits : int, optional
     54         Number of folds. Must be at least 2., by default 5
     55 
     56     metric : str, optional
     57         The distance metric to use. See the documentation of
     58         `sklearn.metrics.pairwise_distances` for valid values.
     59         , by default "euclidean"
     60 
     61         =============== ========================================
     62         metric          Function
     63         =============== ========================================
     64         'cityblock'     metrics.pairwise.manhattan_distances
     65         'cosine'        metrics.pairwise.cosine_distances
     66         'euclidean'     metrics.pairwise.euclidean_distances
     67         'haversine'     metrics.pairwise.haversine_distances
     68         'l1'            metrics.pairwise.manhattan_distances
     69         'l2'            metrics.pairwise.euclidean_distances
     70         'manhattan'     metrics.pairwise.manhattan_distances
     71         'nan_euclidean' metrics.pairwise.nan_euclidean_distances
     72         =============== ========================================
     73 
     74     n_jobs : int, optional
---> 75         The number of parallel jobs., by default None
     76     """
     77     super().__init__(n_splits=n_splits, shuffle=False, random_state=None)
     78     self.metric = metric

File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:167, in get_indexes(self, X)
    152 @overload
    153 def train_test_split(
    154     *arrays,
   (...)
    158     n_jobs: Optional[int] = None,
    159 ) -> list:
    160     pass
    163 def train_test_split(
    164     *arrays,
    165     test_size: Optional[Union[float, int]] = None,
    166     train_size: Optional[Union[float, int]] = None,
--> 167     metric: str = "euclidean",
    168     n_jobs: Optional[int] = None,
    169     **kwargs,
    170 ) -> list:
    171     """Split arrays or matrices into train and test subsets using the
    172     Kennard-Stone algorithm.
    173 
   (...)
    224     ValueError
    225     """
    226     if "shuffle" in kwargs:

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    915         raise ValueError(
    916             "Found array with dim %d. %s expected <= 2."
    917             % (array.ndim, estimator_name)
    918         )
    920     if force_all_finite:
--> 921         _assert_all_finite(
    922             array,
    923             input_name=input_name,
    924             estimator_name=estimator_name,
    925             allow_nan=force_all_finite == "allow-nan",
    926         )
    928 if ensure_min_samples > 0:
    929     n_samples = _num_samples(array)

File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    144 if estimator_name and input_name == "X" and has_nan_error:
    145     # Improve the error message on how to handle missing values in
    146     # scikit-learn.
    147     msg_err += (
    148         f"\n{estimator_name} does not accept missing values"
    149         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    159         "#estimators-that-handle-nan-values"
    160     )
--> 161 raise ValueError(msg_err)

ValueError: Input contains NaN.

Latest Release breaks compatibility with Python 3.7

The latest changes in v2 are no longer compatible with Python 3.7. Run this example in an environment with Python 3.7 installed to reproduce the error:

>>> from kennard_stone import train_test_split
>>> import numpy as np

>>> train_test_split(np.array([1,2,3]).reshape(-1, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 142, in train_test_split
    train, test = next(cv.split(X=arrays[0]))
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1600, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 75, in _iter_indices
    indexes = ks.get_indexes(X)[0]
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 194, in get_indexes
    X=X, lst_idx_selected=lst_idx_selected, idx_remaining=idx_remaining
  File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 211, in _sort
    sum(lst_idx_selected, start=[])
TypeError: sum() takes no keyword arguments

This change in the call to sum should be reverted, or Python 3.7 should be dropped from the list of supported Python versions.

Very slow with large amount of data

Hello yu9824,
The process of train_test_split is quite slow with thousands of data.
With 500 data the split takes only a few seconds, but with 2000 data it takes 7 minutes.
The split process seems to use only 1 CPU core.
I look forward to performance optimizations.

Make conda installation as well

First of all, thank you for writing this library. It is very helpful. But please make it available on conda as well. If you need my help to publish it in the conda-forge channel, I will send a PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.