doctorado-ml / stree Goto Github PK

View Code? Open in Web Editor NEW

8.0 2.0 1.0 10.78 MB

Oblique Tree classifier based on SVM nodes

Home Page: https://stree.readthedocs.io

License: MIT License

Python 72.05% Jupyter Notebook 26.82% Makefile 1.12%

svm svm-classifier oblique-decision-tree python machine-learning

stree's Introduction

STree

Oblique Tree classifier based on SVM nodes. The nodes are built and splitted with sklearn SVC models. Stree is a sklearn estimator and can be integrated in pipelines, grid searches, etc.

Installation

pip install git+https://github.com/doctorado-ml/stree

Documentation

Can be found in stree.readthedocs.io

Examples

Jupyter notebooks

Benchmark
Some features
Gridsearch
Ensembles

Hyperparameters

	Hyperparameter	Type/Values	Default	Meaning
*	C	<float>	1.0	Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
*	kernel	{"liblinear", "linear", "poly", "rbf", "sigmoid"}	linear	Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses liblinear library and the rest uses libsvm library through scikit-learn library
*	max_iter	<int>	1e5	Hard limit on iterations within solver, or -1 for no limit.
*	random_state	<int>	None	Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls
	max_depth	<int>	None	Specifies the maximum depth of the tree
*	tol	<float>	1e-4	Tolerance for stopping criterion.
*	degree	<int>	3	Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
*	gamma	{"scale", "auto"} or <float>	scale	Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, if ‘auto’, uses 1 / n_features.
	split_criteria	{"impurity", "max_samples"}	impurity	Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node**. max_samples is incompatible with 'ovo' multiclass_strategy
	criterion	{“gini”, “entropy”}	entropy	The function to measure the quality of a split (only used if max_features != num_features). Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
	min_samples_split	<int>	0	The minimum number of samples required to split an internal node. 0 (default) for any
	max_features	<int>, <float> or {“auto”, “sqrt”, “log2”}	None	The number of features to consider when looking for the split: If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.
	splitter	{"best", "random", "trandom", "mutual", "cfs", "fcbf", "iwss"}	"random"	The strategy used to choose the feature set at each node (only used if max_features < num_features).
Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. "mutual": Chooses the best features w.r.t. their mutual info with the label. "cfs": Apply Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-Based Filter. "iwss": IWSS based algorithm
	normalize	<bool>	False	If standardization of features should be applied on each node with the samples that reach it
*	multiclass_strategy	{"ovo", "ovr"}	"ovo"	Strategy to use with multiclass datasets, "ovo": one versus one. "ovr": one versus rest

* Hyperparameter used by the support vector classifier of every node

** Splitting in a STree node

The decision function is applied to the dataset and distances from samples to hyperplanes are computed in a matrix. This matrix has as many columns as classes the samples belongs to (if more than two, i.e. multiclass classification) or 1 column if it's a binary class dataset. In binary classification only one hyperplane is computed and therefore only one column is needed to store the distances of the samples to it. If three or more classes are present in the dataset we need as many hyperplanes as classes are there, and therefore one column per hyperplane is needed.

In case of multiclass classification we have to decide which column take into account to make the split, that depends on hyperparameter split_criteria, if "impurity" is chosen then STree computes information gain of every split candidate using each column and chooses the one that maximize the information gain, otherwise STree choses the column with more samples with a predicted class (the column with more positive numbers in it).

Once we have the column to take into account for the split, the algorithm splits samples with positive distances to hyperplane from the rest.

Tests

python -m unittest -v stree.tests

License

STree is MIT licensed

Reference

R. Montañana, J. A. Gámez, J. M. Puerta, "STree: a single multi-class oblique decision tree based on support vector machines.", 2021 LNAI 12882, pg. 54-64

stree's People

Contributors

Stargazers

Watchers

Forkers

phuochaihuynh

stree's Issues

Implement iwss

Add a hyper parameter to accept this feature selection algorithm
Add Tests
Add Split management

Update requirements from mfs to mufs

Can not use mfs as a private package any more, have to change it to mufs

Update version info and publish new build

Add CFS/FCBF to splitter for nodes

Add another couple of values to the hyper parameter splitter:

cfs to apply cfs on every node to select features

fcbf to apply fcbf on every node to select features

Are we going to limit the features returned by these algorithms?

Add multiclass support

add support to classify multiple labels in the same dataset

Add optional static types

Add mypy --strict to the checks done in CI

Change entropy function

Exchange own entropy function with scipy.stats entropy

Add parallel split check

Add the possibility of compute parallel splits then compare information gain to the svm one and take the best split on each node

Add min_samples_split to split criteria

Take into account the number of samples in node to wether split or not the node.

Create package doc

Create package documentation using sphinx

Normalize features on each node

When datasets arrives to a node in train time create and use a scaler to fit/transform dataset featurewise before calling fit with SVC, after that, store scaler in node.

In test time use the scaler to transform dataset before calling fit with SVC.

This transformation doesn't apply to dataset to split, thus the normalization has to be done in every node with the samples that reach them.

Add predict_proba

As predict_proba was developed to the binary case, it can be extrapolated to the multiclass case.

Test error with scikit-learn 0.24.0

======================================================================
FAIL: test_is_a_sklearn_classifier (stree.tests.Stree_test.Stree_test)

Traceback (most recent call last):
File "/home/rof/src/github.com/Doctorado-ML/STree/stree/tests/Stree_test.py", line 135, in test_is_a_sklearn_classifier
check_estimator(Stree())
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 547, in check_estimator
check(estimator)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/_testing.py", line 308, in wrapper
return fn(*args, **kwargs)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 921, in check_sample_weights_invariance
assert_allclose_dense_sparse(X_pred1, X_pred2, err_msg=err_msg)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/_testing.py", line 415, in assert_allclose_dense_sparse
assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=1e-09
For Stree, a zero sample_weight is not equivalent to removing the sample
Mismatched elements: 8 / 16 (50%)
Max absolute difference: 1
Max relative difference: 1.
x: array([1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2])
y: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Add number of leaves and nodes of the tree

Add a method to compute the number of leaves and nodes in the tree built

Add boundary 3d surfaces to grapher

To plot boundary surfaces with rbf kernel we need to plot 3d surfaces.

Add a true random feature selection

As actual random splitter select the best of five random combinations create a true random feature selection

Depth is sometimes wrong

Sometimes depth is 1 unit less than the real depth of the tree built

Create the new version files

Update version number and build files to upload to pypi

Samples in the leaf

Hi,
Thanks for your package.
How can I get the samples at each leaf node?

Add graphviz dot output

generate source to represent the tree

Add source code coverage & code quality metrics

Add code coverage report to Stree
Add Codacy quality metrics

Add different kernels to Stree

Add at least poly and rbf kernels to the models Stree uses in nodes

the optimal hyperplane

Hi, thank you so much for the package!
How to know the optimal hyperplane for each node (except leaf nodes)?
And is it possible to focus on the classification accuracy of a certain class of samples (for example, I want more 0 class samples to be accurately classified instead of getting a higher overall accuracy)?

Complete comments

Add comments to every function of STree that has not obvious behavior

Fix Jupyter notebooks mistakes

Remove no_test. number 9 in features notebook
Neural Network to Neural Net. label in benchmark
Stree label in benchmark
Remove link to adaboost notebook in readme.md
Remove link to test graphics

add SelectKBest scikit function if max_features is not None

max_features can be:

None => all features
Auto = sqrt => sort of # features
log2 => log2 # features
=> features
=> int( * # features)

if set, STree generates as much as 5 features combinations randomly and depending on hyper parameter splitter_type:

random: choose one combination randomly
best: chose the best of five (using information gain)

the improvement could be choosing the best max_features features using SelectKBest

Implement OVO in multi class

Implement One Versus One in multi class classification

Add random subspaces to nodes in STree

if set in the constructor, the nodes shall apply random subspaces in the following way:

Select a given number of feature subsets, these subsets have a preset cardinality (# of subsets and cardinality become hyper parameters of the model). By now it takes into account every subset.
Train a classifier with the node dataset with each subset and select the highest information gain when splitting using the selected criteria
Split the node dataset based on this selected feature subset

Enhance partition criteria in multiclass

Training time: compute Gini to the partitions proposed by every class and take the best. Keep the class chosen in node to use it in predicting

Add a parameter value to split_criteria, ie. 'impurity'.