Giter Site home page Giter Site logo

doctorado-ml / stree Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 1.0 10.78 MB

Oblique Tree classifier based on SVM nodes

Home Page: https://stree.readthedocs.io

License: MIT License

Python 72.05% Jupyter Notebook 26.82% Makefile 1.12%
svm svm-classifier oblique-decision-tree python machine-learning

stree's Introduction

CI CodeQL codecov Codacy Badge PyPI version https://img.shields.io/badge/python-3.8%2B-blue DOI

STree

Oblique Tree classifier based on SVM nodes. The nodes are built and splitted with sklearn SVC models. Stree is a sklearn estimator and can be integrated in pipelines, grid searches, etc.

Stree

Installation

pip install git+https://github.com/doctorado-ml/stree

Documentation

Can be found in stree.readthedocs.io

Examples

Jupyter notebooks

  • benchmark Benchmark

  • features Some features

  • Gridsearch Gridsearch

  • Ensemble Ensembles

Hyperparameters

Hyperparameter Type/Values Default Meaning
* C <float> 1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive.
* kernel {"liblinear", "linear", "poly", "rbf", "sigmoid"} linear Specifies the kernel type to be used in the algorithm. It must be one of ‘liblinear’, ‘linear’, ‘poly’ or ‘rbf’. liblinear uses liblinear library and the rest uses libsvm library through scikit-learn library
* max_iter <int> 1e5 Hard limit on iterations within solver, or -1 for no limit.
* random_state <int> None Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False.
Pass an int for reproducible output across multiple function calls
max_depth <int> None Specifies the maximum depth of the tree
* tol <float> 1e-4 Tolerance for stopping criterion.
* degree <int> 3 Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
* gamma {"scale", "auto"} or <float> scale Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
split_criteria {"impurity", "max_samples"} impurity Decides (just in case of a multi class classification) which column (class) use to split the dataset in a node**. max_samples is incompatible with 'ovo' multiclass_strategy
criterion {“gini”, “entropy”} entropy The function to measure the quality of a split (only used if max_features != num_features).
Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
min_samples_split <int> 0 The minimum number of samples required to split an internal node. 0 (default) for any
max_features <int>, <float>

or {“auto”, “sqrt”, “log2”}
None The number of features to consider when looking for the split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
splitter {"best", "random", "trandom", "mutual", "cfs", "fcbf", "iwss"} "random" The strategy used to choose the feature set at each node (only used if max_features < num_features).
Supported strategies are: “best”: sklearn SelectKBest algorithm is used in every node to choose the max_features best features. “random”: The algorithm generates 5 candidates and choose the best (max. info. gain) of them. “trandom”: The algorithm generates only one random combination. "mutual": Chooses the best features w.r.t. their mutual info with the label. "cfs": Apply Correlation-based Feature Selection. "fcbf": Apply Fast Correlation-Based Filter. "iwss": IWSS based algorithm
normalize <bool> False If standardization of features should be applied on each node with the samples that reach it
* multiclass_strategy {"ovo", "ovr"} "ovo" Strategy to use with multiclass datasets, "ovo": one versus one. "ovr": one versus rest

* Hyperparameter used by the support vector classifier of every node

** Splitting in a STree node

The decision function is applied to the dataset and distances from samples to hyperplanes are computed in a matrix. This matrix has as many columns as classes the samples belongs to (if more than two, i.e. multiclass classification) or 1 column if it's a binary class dataset. In binary classification only one hyperplane is computed and therefore only one column is needed to store the distances of the samples to it. If three or more classes are present in the dataset we need as many hyperplanes as classes are there, and therefore one column per hyperplane is needed.

In case of multiclass classification we have to decide which column take into account to make the split, that depends on hyperparameter split_criteria, if "impurity" is chosen then STree computes information gain of every split candidate using each column and chooses the one that maximize the information gain, otherwise STree choses the column with more samples with a predicted class (the column with more positive numbers in it).

Once we have the column to take into account for the split, the algorithm splits samples with positive distances to hyperplane from the rest.

Tests

python -m unittest -v stree.tests

License

STree is MIT licensed

Reference

R. Montañana, J. A. Gámez, J. M. Puerta, "STree: a single multi-class oblique decision tree based on support vector machines.", 2021 LNAI 12882, pg. 54-64

stree's People

Contributors

rmontanana avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

phuochaihuynh

stree's Issues

Implement iwss

  • Add a hyper parameter to accept this feature selection algorithm
  • Add Tests
  • Add Split management

Add CFS/FCBF to splitter for nodes

Add another couple of values to the hyper parameter splitter:

cfs to apply cfs on every node to select features

fcbf to apply fcbf on every node to select features

Are we going to limit the features returned by these algorithms?

Add parallel split check

Add the possibility of compute parallel splits then compare information gain to the svm one and take the best split on each node

Normalize features on each node

When datasets arrives to a node in train time create and use a scaler to fit/transform dataset featurewise before calling fit with SVC, after that, store scaler in node.

In test time use the scaler to transform dataset before calling fit with SVC.

This transformation doesn't apply to dataset to split, thus the normalization has to be done in every node with the samples that reach them.

Add predict_proba

As predict_proba was developed to the binary case, it can be extrapolated to the multiclass case.

Test error with scikit-learn 0.24.0

======================================================================
FAIL: test_is_a_sklearn_classifier (stree.tests.Stree_test.Stree_test)

Traceback (most recent call last):
File "/home/rof/src/github.com/Doctorado-ML/STree/stree/tests/Stree_test.py", line 135, in test_is_a_sklearn_classifier
check_estimator(Stree())
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 547, in check_estimator
check(estimator)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/_testing.py", line 308, in wrapper
return fn(*args, **kwargs)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 921, in check_sample_weights_invariance
assert_allclose_dense_sparse(X_pred1, X_pred2, err_msg=err_msg)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/sklearn/utils/_testing.py", line 415, in assert_allclose_dense_sparse
assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/home/rof/.pyenv/versions/3.8.6/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=1e-09
For Stree, a zero sample_weight is not equivalent to removing the sample
Mismatched elements: 8 / 16 (50%)
Max absolute difference: 1
Max relative difference: 1.
x: array([1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2])
y: array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Samples in the leaf

Hi,
Thanks for your package.
How can I get the samples at each leaf node?

the optimal hyperplane

Hi, thank you so much for the package!
How to know the optimal hyperplane for each node (except leaf nodes)?
And is it possible to focus on the classification accuracy of a certain class of samples (for example, I want more 0 class samples to be accurately classified instead of getting a higher overall accuracy)?

Complete comments

Add comments to every function of STree that has not obvious behavior

Fix Jupyter notebooks mistakes

  • Remove no_test. number 9 in features notebook
  • Neural Network to Neural Net. label in benchmark
  • Stree label in benchmark
  • Remove link to adaboost notebook in readme.md
  • Remove link to test graphics

add SelectKBest scikit function if max_features is not None

max_features can be:

  • None => all features
  • Auto = sqrt => sort of # features
  • log2 => log2 # features
  • => features
  • => int( * # features)

if set, STree generates as much as 5 features combinations randomly and depending on hyper parameter splitter_type:

  • random: choose one combination randomly
  • best: chose the best of five (using information gain)

the improvement could be choosing the best max_features features using SelectKBest

Add random subspaces to nodes in STree

if set in the constructor, the nodes shall apply random subspaces in the following way:

  1. Select a given number of feature subsets, these subsets have a preset cardinality (# of subsets and cardinality become hyper parameters of the model). By now it takes into account every subset.
  2. Train a classifier with the node dataset with each subset and select the highest information gain when splitting using the selected criteria
  3. Split the node dataset based on this selected feature subset

Enhance partition criteria in multiclass

Training time: compute Gini to the partitions proposed by every class and take the best. Keep the class chosen in node to use it in predicting

Add a parameter value to split_criteria, ie. 'impurity'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.