yu9824 / kennard_stone Goto Github PK
View Code? Open in Web Editor NEWThis is an algorithm for evenly partitioning.
Home Page: https://pypi.org/project/kennard-stone/
License: MIT License
This is an algorithm for evenly partitioning.
Home Page: https://pypi.org/project/kennard-stone/
License: MIT License
In version 2.10, when input contains NaN, it raises ValueError.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File <timed exec>:12
File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:142, in train_test_split(test_size, train_size, *arrays, **kwargs)
136 indexes = ks.get_indexes(X)[0]
138 n_samples = _num_samples(X)
139 n_train, n_test = _validate_shuffle_split(
140 n_samples,
141 self.test_size,
--> 142 self.train_size,
143 default_test_size=self._default_test_size,
144 )
146 for _ in range(self.get_n_splits()):
147 ind_test = indexes[:n_test]
File ~/mambaforge/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1689, in BaseShuffleSplit.split(self, X, y, groups)
1659 """Generate indices to split data into training and test set.
1660
1661 Parameters
(...)
1686 to an integer.
1687 """
1688 X, y, groups = indexable(X, y, groups)
-> 1689 for train, test in self._iter_indices(X, y, groups):
1690 yield train, test
File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:75, in _iter_indices(self, X, y, groups)
41 def __init__(
42 self,
43 n_splits: int = 5,
(...)
47 **kwargs,
48 ) -> None:
49 """K-Folds cross-validator using the Kennard-Stone algorithm.
50
51 Parameters
52 ----------
53 n_splits : int, optional
54 Number of folds. Must be at least 2., by default 5
55
56 metric : str, optional
57 The distance metric to use. See the documentation of
58 `sklearn.metrics.pairwise_distances` for valid values.
59 , by default "euclidean"
60
61 =============== ========================================
62 metric Function
63 =============== ========================================
64 'cityblock' metrics.pairwise.manhattan_distances
65 'cosine' metrics.pairwise.cosine_distances
66 'euclidean' metrics.pairwise.euclidean_distances
67 'haversine' metrics.pairwise.haversine_distances
68 'l1' metrics.pairwise.manhattan_distances
69 'l2' metrics.pairwise.euclidean_distances
70 'manhattan' metrics.pairwise.manhattan_distances
71 'nan_euclidean' metrics.pairwise.nan_euclidean_distances
72 =============== ========================================
73
74 n_jobs : int, optional
---> 75 The number of parallel jobs., by default None
76 """
77 super().__init__(n_splits=n_splits, shuffle=False, random_state=None)
78 self.metric = metric
File ~/mambaforge/lib/python3.11/site-packages/kennard_stone/kennard_stone.py:167, in get_indexes(self, X)
152 @overload
153 def train_test_split(
154 *arrays,
(...)
158 n_jobs: Optional[int] = None,
159 ) -> list:
160 pass
163 def train_test_split(
164 *arrays,
165 test_size: Optional[Union[float, int]] = None,
166 train_size: Optional[Union[float, int]] = None,
--> 167 metric: str = "euclidean",
168 n_jobs: Optional[int] = None,
169 **kwargs,
170 ) -> list:
171 """Split arrays or matrices into train and test subsets using the
172 Kennard-Stone algorithm.
173
(...)
224 ValueError
225 """
226 if "shuffle" in kwargs:
File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
915 raise ValueError(
916 "Found array with dim %d. %s expected <= 2."
917 % (array.ndim, estimator_name)
918 )
920 if force_all_finite:
--> 921 _assert_all_finite(
922 array,
923 input_name=input_name,
924 estimator_name=estimator_name,
925 allow_nan=force_all_finite == "allow-nan",
926 )
928 if ensure_min_samples > 0:
929 n_samples = _num_samples(array)
File ~/mambaforge/lib/python3.11/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
144 if estimator_name and input_name == "X" and has_nan_error:
145 # Improve the error message on how to handle missing values in
146 # scikit-learn.
147 msg_err += (
148 f"\n{estimator_name} does not accept missing values"
149 " encoded as NaN natively. For supervised learning, you might want"
(...)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)
ValueError: Input contains NaN.
The latest changes in v2 are no longer compatible with Python 3.7. Run this example in an environment with Python 3.7 installed to reproduce the error:
>>> from kennard_stone import train_test_split
>>> import numpy as np
>>> train_test_split(np.array([1,2,3]).reshape(-1, 1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 142, in train_test_split
train, test = next(cv.split(X=arrays[0]))
File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 1600, in split
for train, test in self._iter_indices(X, y, groups):
File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 75, in _iter_indices
indexes = ks.get_indexes(X)[0]
File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 194, in get_indexes
X=X, lst_idx_selected=lst_idx_selected, idx_remaining=idx_remaining
File "/home/jackson/mambaforge/envs/py37/lib/python3.7/site-packages/kennard_stone/kennard_stone.py", line 211, in _sort
sum(lst_idx_selected, start=[])
TypeError: sum() takes no keyword arguments
This change in the call to sum should be reverted, or Python 3.7 should be dropped from the list of supported Python versions.
Hello yu9824,
The process of train_test_split
is quite slow with thousands of data.
With 500 data the split takes only a few seconds, but with 2000 data it takes 7 minutes.
The split process seems to use only 1 CPU core.
I look forward to performance optimizations.
First of all, thank you for writing this library. It is very helpful. But please make it available on conda as well. If you need my help to publish it in the conda-forge
channel, I will send a PR.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.