nok / sklearn-porter Goto Github PK

View Code? Open in Web Editor NEW

1.3K 33.0 168.0 2.98 MB

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

machine-learning data-science scikit-learn sklearn

sklearn-porter's Issues

(SVC export) N_vector size of the exported C is not the same with the size of training sample

I fount that N_vector size of the exported C is not the same with the size of the training sample.

Method:
I use the sample code on https://github.com/nok/sklearn-porter/blob/stable/examples/estimator/classifier/SVC/c/basics.pct.ipynb
to export the C code.
I split the training and test set of 90%:10% by the following code:

from sklearn.model_selection import train_test_split
irisdata = load_iris()
X=irisdata.data
y=irisdata.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=True)
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)

Output:
(135, 4) (135,)(15, 4) (15,)

Then I train the model:

clf = svm.SVC(C=1.0, gamma = 0.001, kernel = 'rbf', random_state = 0)
clf.fit(X_train,y_train)

Finally I exportthe code:

porter = Porter(clf, language = 'c')
output = porter.export()
print(output)

But I got:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

#define N_FEATURES 4
#define N_CLASSES 3
#define N_VECTORS 132
#define N_ROWS 3
#define N_COEFFICIENTS 2
#define N_INTERCEPTS 3
#define KERNEL_TYPE 'r'
#define KERNEL_GAMMA 0.001
#define KERNEL_COEF 0.0
#define KERNEL_DEGREE 3

double vectors[132][4] = {{4.4, 3.2, 1.3, 0.2}, {5.4, 3.4, 1.5, 0.4}, {5.0, 3.2, 1.2, 0.2}, {5.0, 3.5, 1.3, 0.3}, {5.5, 4.2, 1.4, 0.2}, {5.1, 3.8, 1.5, 0.3}, {5.3, 3.7, 1.5, 0.2}, {5.2, 3.4, 1.4, 0.2}, {5.1, 3.5, 1.4, 0.3}, {5.7, 3.8, 1.7, 0.3}, {5.0, 3.6, 1.4, 0.2}, {4.8, 3.0, 1.4, 0.3}, {5.1, 3.4, 1.5, 0.2}, {5.5, 3.5, 1.3, 0.2}, {4.8, 3.4, 1.6, 0.2}, {4.8, 3.0, 1.4, 0.1}, {4.7, 3.2, 1.3, 0.2}, {4.6, 3.4, 1.4, 0.3}, {5.1, 3.8, 1.6, 0.2}, {5.4, 3.7, 1.5, 0.2}, {4.9, 3.1, 1.5, 0.2}, {5.2, 4.1, 1.5, 0.1}, {4.4, 3.0, 1.3, 0.2}, {5.2, 3.5, 1.5, 0.2}, {5.1, 3.3, 1.7, 0.5}, {4.9, 3.1, 1.5, 0.1}, {5.7, 4.4, 1.5, 0.4}, {4.5, 2.3, 1.3, 0.3}, {5.0, 3.4, 1.6, 0.4}, {5.0, 3.5, 1.6, 0.6}, ...
......

The
N_VECTORS is 132 instead of 135.

I tried other split ratios and the following are some examples:

Training test ratio	training size	Exported N_VECTORS
0.5	75	75
0.4	90	89
0.3	105	100
0.2	120	113
0.1	135	132
0.05	142	141
0	150	150

Support of the OneVsRestClassifier

Original issue found by @Phyks in #18 (comment):

ValueError: Currently the given model 'OneVsRestClassifier(estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1)' isn't supported.

This issue requires some refactorings of (maybe) all templates. The goal is using object oriented instances the right way, not static methods. Further a solution should be found how we can use templates for the non-object-oriented programming languages.

Error when trying to convert RandomForestClassifier to javascript

Hi,

I am trying to run the command:
python -m sklearn_porter -i estimator.pkl --js
as instructed on the github readme, with a sklearn random forest classifier that I saved into estimator.pkl as instructed. I am using Python 3.6 from Anaconda on a Ubuntu 16.04 LTS.
But it fails with following error:
Traceback (most recent call last):
File "/home/user/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/user/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/anaconda3/lib/python3.6/site-packages/sklearn_porter/main.py", line 153, in
main()
File "/home/user/anaconda3/lib/python3.6/site-packages/sklearn_porter/main.py", line 105, in main
estimator = joblib.load(input_path)
File "/home/user/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 578, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/home/user/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle
obj = unpickler.load()
File "/home/user/anaconda3/lib/python3.6/pickle.py", line 1050, in load
dispatchkey[0]
KeyError: 239

ETA for RandomForestRegressor for JS?

Just wondering if there is an ETA for a JS implementation for RandomForestRegressor?

Thanks!

MultiOutputClassifier not supported

ValueError: Currently the given estimator 'MultiOutputClassifier(estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=1, learning_rate='constant',
learning_rate_init=0.001, max_iter=1, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False),
n_jobs=1)' isn't supported.

Decision tree C code exported by porter has wrong datatype for features array it should be float

C code exported by porter has wrong data type for feature value as double which will cause accuracy percentage.

scikit-learn code

def predict(self, X, check_input=True):

         """Predict class or regression value for X.
        For a classification model, the predicted class for each sample in X is
        returned. For a regression model, the predicted value based on X is
        returned.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        check_input : boolean, (default=True)
            Allow to bypass several input checking.
            Don't use this parameter unless you know what you do.
        Returns
        -------
        y : array of shape = [n_samples] or [n_samples, n_outputs]
            The predicted classes, or the predict values.
        """

porter C Code:

int main(int argc, const char * argv[]) {{
    /* Features: */
    double features[argc-1];
    int i;
    for (i = 1; i < argc; i++) {{
        features[i-1] = atof(argv[i]);
    }}

    /* Prediction: */
    printf("%d", {method_name}(features, 0));
    return 0;

}}

How to read the multi-layer perceptrons model in Java written using python

I am using the wrapper of scikit-learn Multilayer Perceptron in Python https://github.com/aigamedev/scikit-neuralnetwork to train the neural network and save it to a file. Now, I want to expose it on production to predict in real time. So, I was thinking to use Java for better concurrency than Python. Hence, my question is whether can we read the model using this library written using Python or above wrapper? The code below I am using for training the model and last three lines I want to port to Java to expose it on production

import pickle
import numpy as np
import pandas as pd
from sknn.mlp import Classifier, Layer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

f = open("TrainLSDataset.csv")
data = np.loadtxt(f,delimiter = ',')

x = data[:, 1:]
y = data[:, 0]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

nn = Classifier(
    layers=[            	    
        Layer("Rectifier", units=5),
        Layer("Softmax")],
    learning_rate=0.001,
    n_iter=100)

nn.fit(X_train, y_train)
filename = 'finalized_model.txt'
pickle.dump(nn, open(filename, 'wb')) 

**Below code i want to write in Java/GoLang for exposing it on Production** :

loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
y_pred = loaded_model.predict(X_test)

Golang SVC

I am wondering if it’s possible to add Golang support for SVM? Seems like Golang support is still limited.

RFC prediction are inconsistent when using `max_depth`

I have created a RandomForestclassifier in Python using sklearn. Now I convert the code to C using sklearn-porter. In around 10-20% of the cases the prediction of the transpiled code is wrong.

I figured that the problem occurs when specifying max_depth.

Here's some code to reproduce the issue:

import numpy as np
import sklearn_porter
from sklearn.ensemble import RandomForestClassifier

train_x = np.random.rand(1000, 8)
train_y = np.random.randint(0, 4, 1000)

# when using max_depth='auto', the problem does not occur
rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 1.0

# now using max_depth=10 the integrity
rfc = RandomForestClassifier(n_estimators=10, max_depth=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 0.829

I also saw that Python is performing calculations with double while the C code seems to use float, might that be an issue? (changing float -> double did not change anything unfortunately).

Fails with big dataset

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1, train_size=0.0001)
clf = DecisionTreeClassifier()
clf.fit(train_X, train_y)

Export:

porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)

fails with bigger train sizespython3.7/site-packages/sklearn_porter/estimator/classifier/DecisionTreeClassifier/init.py", line 308, in create_branches
out += temp.format(features[node], '<=', self.repr(threshold[node]))
IndexError: list index out of range

Multilabel prediction

I am trying to use the Sklearn Porter to transform my multilabel randomforest Classifier into Javascript. But the transformed Classifier doesn't predict multiple label.

Does the Sklearn Porter support multilabel prediction? If yes, could you please provide a small example of the implementation?

Is there any plans to suuport GradientBoostingClassifier and CalibratedClassifier?

Invalid java generated for random forrest

There are several compile errors when transpiling random forrests to java:

At the start of each predict_x method:
int[] classes = new int[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]];should be
int[] classes = new int[] { 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 };.
At the end of each predict_x method:
for (int i = 1; i < [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]; i++)should be
for (int i = 1; i < classes.length; i++).
At the start of each predict method:
int n_classes = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]; int[] classes = new int[n_classes]; should be
int[] classes = new int[] { 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }; int n_classes = classes.length;

Maybe there are other errors too, because the transpiled random forrest does not produce the same result as python.

Errors when porting LinearSVC model

Sorry to bother you again, but when attempting to run:
python3 -m sklearn_porter -i model_notokenizer.pkl -l java I get:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/sklearn_porter/__main__.py", line 71, in <module>
    main()
  File "/usr/local/lib/python3.5/dist-packages/sklearn_porter/__main__.py", line 49, in main
    porter = Porter(model, language=language)
  File "/usr/local/lib/python3.5/dist-packages/sklearn_porter/Porter.py", line 65, in __init__
    raise ValueError(error)
ValueError: The given model 'Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=0.001,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])' isn't supported.

I'm running python 3.5.2, numpy 1.13.1, and sklearn 0.19.0.

Export Matrix as Vector (SVM and maybe other Models)

Firstly, I wold like to thank the authors of the library, it is really useful.

Most of Java Algebra libraries are based on 1D primitive arrays (probably other languages too) instead of 2D (it is easy to map one to another and the algorithms in 1D are simpler to write). One option is to create a new 1D array and copy the data from the 2D, but it is not a desired approach. Then, I suggest that you provide a way to save the data as a 1D primitive array (more especially a 1D column array). I started doing this in a copy of the repository, but I guess you can do it in a future release.

I have an observation about the SVC template (I guess it should be in another place). When you save a model that has two classes, I guess the use of starts and end arrays are redundant, because coefficients is an ordered array (in the sense that all coefficients of the class zero are before any coefficient of the class one). It means you could change:

...
if (this.clf.nClasses == 2) {
    for (int i = 0; i < kernels.length; i++) {
        kernels[i] = -kernels[i];
    }
    double decision = 0.;
    for (int k = starts[1]; k < ends[1]; k++) {
        decision += kernels[k] * this.clf.coefficients[0][k];
    }
    for (int k = starts[0]; k < ends[0]; k++) {
        decision += kernels[k] * this.clf.coefficients[0][k];
    }            
    decision += this.clf.intercepts[0];            
    if (decision > 0) {
        return 0;
    }
    return 1;
}
...

to:

...
if (this.clf.nClasses == 2) {
    for (int i = 0; i < kernels.length; i++) {
        kernels[i] = -kernels[i];
    }
    double decision = 0.;
    for (int k = 0; k < clf.coefficients[0].length; k++) {
        decision += kernels[k] * this.clf.coefficients[0][k];
    }            
    decision += this.clf.intercepts[0];            
    if (decision > 0) {
        return 0;
    }
     return 1;
}
...

I guess you could improve the case of more then two classes too, merging the structures decisions, votes and amounts.

Best Regards,

Charles

Installing using anaconda

Hello
I'm using anaconda3 and I want to install sklearn-porter
What will be the best way to do this?
conda install doesn't work:

okoob$ conda install sklearn-porter
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - sklearn-porter

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/pro/osx-64
  - https://repo.anaconda.com/pkgs/pro/noarch

install version 0.7.0+ ERROR

After executing "PIP install sklearn Porter" in win7 + Python 3.5, I install 7.0 + by default. An error is reported:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in positio
egal multibyte sequence

ERROR: Command "python setup.py egg_info" failed with error code 1 in
aohf\AppData\Local\Temp\pip-install-kzqmrh2h\sklearn-porter\

Problems installing

I am unable to install sklearn-porter for python3 through pip.

Collecting sklearn-porter
  Using cached sklearn-porter-0.5.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-bdmsnkqx/sklearn-porter/setup.py", line 18, in <module>
        with open(requirements_path) as f:
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-build-bdmsnkqx/sklearn-porter/requirements.txt'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-bdmsnkqx/sklearn-porter/

I am running python 3.6 on Arch Linux.

SVC predict_proba not supported

when i try to export "porter = Porter(model, language='java', method= 'predict_proba')" from SVC model it returns "Currently the chosen model method 'predict_proba' isn't supported." is there any solution to get the probability of the predicted class ?!

Java Error: "Too many constants"

Attempted to port a somewhat large random forest classifier (7.2 MB) for Java and compiling the Java class ended up giving a "too many constants" error, because of the number of hardcoded values to compose the tree. I circumvented this by using a simple script to separate out all (static) methods into individual classes and files. Is there a cleaner way internally to get around this problem or achieve this effect?

Feature Request: Multi-label DecisionTreeClassifier

I have trained a multi-label DecisionTreeClassifier and when I ported the result is the following:

public static int predict(double[] features) {
        int[] classes = new int[2];
            
        if (features[11] <= 12.5) {
            if (features[10] <= 182.5) {
                if (features[12] <= 72.5) {
                    if (features[13] <= 63.0) {
                        if (features[7] <= 767.5) {
                            classes[0] = 20; 
                            classes[1] = 5;
                            // Here the result shoud be: 
                            //classes[0][0] = 20; classes[0][1] = 5; classes[1][0] = 25; classes[1][1] = 0; And so on...
                        }else{
                            //Huge amount of ifs
                        }
                       }
                  }
                }
             }

The full decision tree is here:

I really appreciate this feature in your porter. =]

Additionally, if this feature is already present in the code, I haven't figured out how to use it.

Thank you.

Port CountVectorizer

For text mining it's important to fit also a CountVectorizer (or a TFIDFTransformer), so should be possible to export it in the targhet lenguage

Feature Request: Multinomial Naive Bayes

Could you please add support for Multinomial Naive Bayes? It's performance on text classification makes it a very desirable target for porting.

C# support

Darius, are you fine with the idea to support C#? If so, I will go ahead whenever I have free time. I might also contribute on other parts when I'm done.
Thanks for letting me know, best, Balint

Trouble installing

Hi -- tried installing and running the Go example, but I'm getting this error:

$ cd sklearn-porter/examples/classifier/LinearSVC/go
$ python basics.py

Traceback (most recent call last):
  File "basics.py", line 12, in <module>
    model = Porter(language='go').port(clf)
  File "/Users/bjohnson/anaconda/lib/python2.7/site-packages/sklearn_porter-0.2.0-py2.7.egg/sklearn_porter/__init__.py", line 56, in port
    ported_model = instance.port(model)
  File "/Users/bjohnson/anaconda/lib/python2.7/site-packages/sklearn_porter-0.2.0-py2.7.egg/sklearn_porter/classifier/LinearSVC/__init__.py", line 69, in port
    return self.predict()
  File "/Users/bjohnson/anaconda/lib/python2.7/site-packages/sklearn_porter-0.2.0-py2.7.egg/sklearn_porter/classifier/LinearSVC/__init__.py", line 81, in predict
    return self.create_class(self.create_method())
  File "/Users/bjohnson/anaconda/lib/python2.7/site-packages/sklearn_porter-0.2.0-py2.7.egg/sklearn_porter/classifier/LinearSVC/__init__.py", line 111, in create_method
    return self.temp('method', indentation=1, skipping=True).format(
  File "/Users/bjohnson/anaconda/lib/python2.7/site-packages/sklearn_porter-0.2.0-py2.7.egg/sklearn_porter/classifier/__init__.py", line 114, in temp
    raise AttributeError('Template \'%s\' not found.' % (name))
AttributeError: Template 'method' not found.

Any thoughts? Thanks!

Is there any plans to support "distance" weights and "predict_proba" method for KNearestNeighbour ?

Currently i am generating my java file with training "uniform" weights but i want to generate java code with the "distance" weights.

When i am trying to generate it with the "distance" weight it generates NotImplementedError.

If i have to implement it on my own, How can i implement ?.

Any idea can help me. :)

A bug

Hey guys ， there's a little bug —— "ValueError: ("The classifier doesn't support the given base estimator %s.", None)" . It seems that the default base estimator for adaboost is DecisionTree but the context code here is "if not isinstance(estimator.base_estimator, DecisionTreeClassifier):". So , please use clf.base_estimator_ (the variable ending with _ is automatically generated by code not given by user ).

Can't test accuracy in python and exported java code gives bad accuracy!

Hi, I started using your code to port a random forest estimator, first off I can't call the porter.integrity_score() function cause I get the following error:

Traceback (most recent call last):
  File "C:/Python Project/Euler.py", line 63, in <module>
    accuracy = porter.integrity_score(test_X)
  File "C:\Python\lib\site-packages\sklearn_porter\Porter.py", line 440, in integrity_score
    keep_tmp_dir=True, num_format=num_format)
  File "C:\Python\lib\site-packages\sklearn_porter\Porter.py", line 342, in predict
    self._test_dependencies()
  File "C:\Python\lib\site-packages\sklearn_porter\Porter.py", line 454, in _test_dependencies
    raise EnvironmentError(error)
OSError: The required dependencies aren't available on Windows.

So I can't check the accuracy in python, and when I used the java code in eclipse it gives me very bad accuracy, the original scikit model gave me about 69% accuracy whereas the accuracy from the java code is less than 10%.

I need the code for an important project, would really appreciate some help on this.

java.lang.ArrayIndexOutOfBoundsException in Java

I used code tag but I don't know why it doesn't show end line propriety

There are 2 bugs in code of Java. They are the same type.

double[] decisions = new double[13];
for (int i = 0, d = 0, l = 13; i < l; i++) {
    for (int j = i + 1; j < l; j++) {
        double tmp1 = 0., tmp2 = 0.;
        for (int k = starts[j]; k < ends[j]; k++) {
           tmp1 += kernels[k] * coeffs[i][k];
        }
        for (int k = starts[i]; k < ends[i]; k++) {
            tmp2 += kernels[k] * coeffs[j - 1][k];
        }
        System.out.println("d=" + d);
        decisions[d] = tmp1 + tmp2 + inters[d++];
    }
}

In my understanding, the second loop will run with d and l are always initialized to 0 and 13 correspondingly. But actually it won't. This code raise exception ArrayIndexOutOfBoundsException.

I suggest that you need to add a line: d = 0; right after the first for statement (l=13; doesn't need because it is a constant here).

The same bug with this code:

int[] votes = new int[13];
for (int i = 0, d = 0, l = 13; i < l; i++) {
    for (int j = i + 1; j < l; j++) {
        votes[d] = decisions[d++] > 0 ? i : j;
    }
}

When I run this, I got ValueError: level must be >= 0

I installed this by python setup.py install

I used python 3

$ python -m sklearn_porter --input cl.pkl --language java

Traceback (most recent call last):
  File "/home/rem/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/rem/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rem/anaconda3/lib/python3.5/site-packages/sklearn_porter-0.2.0-py3.5.egg/sklearn_porter/__main__.py", line 54, in <module>
    main()
  File "/home/rem/anaconda3/lib/python3.5/site-packages/sklearn_porter-0.2.0-py3.5.egg/sklearn_porter/__main__.py", line 35, in main
    result = porter.port(raw_model)
  File "/home/rem/anaconda3/lib/python3.5/site-packages/sklearn_porter-0.2.0-py3.5.egg/sklearn_porter/__init__.py", line 50, in port
    locals(), [md_name], -1)
ValueError: level must be >= 0

Export Probability of the predictions for decision trees involved model

Firstly, thank you for this great project.

For models involving decision tree such as decision tree, random forrest, the probability of the predictions is often as crucial as predictions themselves as it carries more infomation than simply a result.

And as far as implementation go, since the porter needs to build every leafnode, I think it's possible to export the probability of the leafnode then aggregate.

So is there any way to do that?

Any plans to support XGBoost?

C code generator with multi output RFC - illegal code generated and general failure to handle multi dimension output

I'm creating a Random Forest Classifier that features 248 inputs and 108 outputs. Based on the Boolean state of each input the 108 outputs will be on or off (They represent valves). The value of these discreet output states is what the system has learned. There are two issues I'm having with this:

The code generator only seems to create trees for one output, and I don't know which one. For each output I'd expect a separate set of trees, because the inputs remain the same, but the decision tree for each valve's state will be different.
The code for the single output generates invalid C. See below for example code fragment.

`int predict_0(float features[]) {
int classes[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]];

if (features[181] <= 0.5) { ... }
}`

Error when exporting SVC fitted using sparse matrix

I have an svm/svc classifier trained using sparse matrix as follows:

from sklearn_porter import Porter
from sklearn import svm

# load data and train the classifier:
clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1/X_train_transformed.shape[1], kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
clf.fit(X_train_transformed, X_train['label'])
type(X_train_transformed)
----------------------------------------

scipy.sparse.csr.csr_matrix

The problem is that exporting fails with the errors shown bellow:

# export:
porter = Porter(clf, language='java')
output = porter.export(embed_data=False, details=False)
with open('SVC.java', 'w') as f:
    f.writelines(output)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-e7d647ff66cd> in <module>()
      1 # export:
      2 porter = Porter(clf, language='java')
----> 3 output = porter.export(embed_data=False, details=False)
      4 with open('SVC.java', 'w') as f:
      5     f.writelines(output)

~/.conda/envs/ml/lib/python3.6/site-packages/sklearn_porter/Porter.py in export(self, class_name, method_name, num_format, details, **kwargs)
    187 
    188         output = self.template.export(class_name=class_name,
--> 189                                       method_name=method_name, **kwargs)
    190         if not details:
    191             return output

~/.conda/envs/ml/lib/python3.6/site-packages/sklearn_porter/estimator/classifier/SVC/__init__.py in export(self, class_name, method_name, export_data, export_dir, export_filename, export_append_checksum, **kwargs)
    131         self.params = params
    132 
--> 133         self.n_features = len(est.support_vectors_[0])
    134         self.svs_rows = est.n_support_
    135         self.n_svs_rows = len(est.n_support_)

~/.conda/envs/ml/lib/python3.6/site-packages/scipy/sparse/base.py in __len__(self)
    264     # non-zeros is more important.  For now, raise an exception!
    265     def __len__(self):
--> 266         raise TypeError("sparse matrix length is ambiguous; use getnnz()"
    267                         " or shape[0]")
    268 

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

Decision tree classifier porter C code predicting index of classes not actual class

Attaching training data csv file where first column is target class to predict.
I generated pickle file and using sklearn-porter command line i convert pickle file to C Code and ran it.
C code returning index of classes not actual classes and Python predict() function returns actual class not index.

Attaching training csv file pickle file.
csv_and_pickle_file.zip

SVC gamma value of 'auto' crashes when export_data=True

For SVC's, the default value of gamma is 'auto' which causes porter to crash when using exporting data is true.

>>> output = porter.export(export_data=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/Porter.py", line 189, in export method_name=method_name, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/estimator/classifier/SVC/__init__.py", line 192, in export
export_append_checksum)
File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/estimator/classifier/SVC/__init__.py", line 239, in export_data
'gamma': float(self.gamma),
ValueError: could not convert string to float: auto

The solution is to check if it's equal to 'auto'; if so, then set gamma equal to 1/n_features, such as:

>>> clf.gamma
'auto'
>>> clf.gamma = 1/float(clf.support_vectors_.shape[1])
>>> output = porter.export(export_data=True) # success

requirements.txt file not found

the package installed with no errors using pip3.

I get the following error when importing. Using python 3.7.2 on MacOS.

Thanks

$ python3
Python 3.7.2 (default, Jan 13 2019, 12:51:54)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from sklearn_porter import Porter
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.7/site-packages/sklearn_porter/init.py", line 42, in
meta = _load_meta(package)
File "/usr/local/lib/python3.7/site-packages/sklearn_porter/init.py", line 26, in _load_meta
reqs = open(req_path, 'r').read().strip().split('\n')
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/site-packages/sklearn_porter/../requirements.txt'

Naive Bayes predicting the same label

I am figuring out which is the best machine learning approach to use for a pedometer project. So far I have gyro and accelerometer data of walking and no walking. When I train and test a Naive Bayes model in my machine I get nearly 70 of accuracy. However, when I port to java and add it to my android app and start using the implementation it is just predicting the same label. Several question arise from this: why is this happening?... Do I need to use an online learning algorithm for this scenario?, the balance of my classes is wrong?

Single feature RandomForestClassifier throws index out of range exception

I've built a very simple single feature RandomForestClassifier:

from sklearn.ensemble import RandomForestClassifier
import numpy as np

from sklearn_porter import Porter

rf = RandomForestClassifier()
features = [[i] for i in xrange(0, 10)]
labels = [i > 5 for i in xrange(0, 10)]

rf.fit(features, labels)

for feature in xrange(-20, 20):
	print feature, '->', rf.predict(np.array([feature]).reshape(1, -1))

result = Porter(language='java').port(rf)
print result

which gives the following stack trace:

Traceback (most recent call last):
  File "generateModel.py", line 21, in <module>
    result = Porter(language='java').port(rf)
  File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/__init__.py", line 72, in port
    ported_model = instance.port(model)
  File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/classifier/RandomForestClassifier/__init__.py", line 84, in port
    return self.predict()
  File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/classifier/RandomForestClassifier/__init__.py", line 95, in predict
    return self.create_class(self.create_method())
  File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/classifier/RandomForestClassifier/__init__.py", line 198, in create_method
    tree = self.create_single_method(idx, model)
  File "/usr/local/lib/python2.7/dist-packages/sklearn_porter/classifier/RandomForestClassifier/__init__.py", line 162, in create_single_method
    indices.append([str(j) for j in range(model.n_features_)][i])
IndexError: list index out of range

The line in question involves indexing into the feature vector, but sometimes the index is negative, which is fine except when it wraps around the list twice. In this case, model.n_features_ is 1 but i (the index) is -2, giving the list out of range exception. What is the best solution for this? Would simply taking the modulus of the index by the length of list be correct?

Thanks!

C RandomForestClassifier code generation applies num_format lambda to feature indicis

Example: I would like to use float constants.

porter = Porter(clf, language='c')
output = porter.export(embed_data=True, num_format=lambda o: str(o) + 'f' )

C- Output

#include <stdlib.h>                                                                                                                                
#include <stdio.h>                                                                                                                                 
#include <math.h>                                                                                                                                  
                                                                                                                                                   
int predict_0(float features[]) {                                                                                                                  
    int classes[3];

    if (features[3f] <= 0.800000011920929f) {
        classes[0] = 49;
        classes[1] = 0;
        classes[2] = 0;
    } else {
        if (features[3f] <= 1.75f) {

Divergence Between Original Model and Ported Model

I've noticed that the integrity score of the JavaScript ported model for extraTrees is around .86 (after sampling a few thousand random inputs). What would be some possible reasons for the cause of such a large this divergence? The extraTrees have 3 estimators with depth 2.

Using a nuSVC Model

Hi,
I know sklearn-porter doesn't support nu-SVCs, but those are mathematically equivalent to svm.SVC models (see http://scikit-learn.org/stable/modules/svm.html#nusvc).
I was wondering if there was a workaround for this ?
Thank you

“Code too large” in Android Studio when using exported sklearn model

Currently I am trying to export a model from sklearn to Android. For this I use the library sklearn-porter.

I have asked the same question on stackoverflow.
question

This generates a Java class from the trained model, which looks like the following:


class DecisionTreeClassifier {

   public static int predict(double[] features) {
        int[] classes = new int[2];

        if (features[350] <= 0.5156863033771515) {
            if (features[568] <= 0.0019607844296842813) {
                if (features[430] <= 0.0019607844296842813) {
                    if (features[405] <= 0.009803921915590763) {
...
}

This file has a size of about 1 MB and thus the error "Code too large" occurs in Android Studio.

Is there a solution for this problem?

Error when installing

The command: pip install sklearn-porter produces the following:

Collecting sklearn-porter
  Using cached sklearn-porter-0.3.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-2k6e9qlh/sklearn-porter/setup.py", line 6, in <module>
        from sklearn_porter import Porter
      File "/tmp/pip-build-2k6e9qlh/sklearn-porter/sklearn_porter/__init__.py", line 3, in <module>
        from Porter import Porter
    ImportError: No module named 'Porter'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-2k6e9qlh/sklearn-porter/

Online learning algorithms?

Which algorithms can be used for streams of data? (e.g. passive aggressive, perceptron )?

SVC (kernel=linear) JS prediction logic

Hi @nok, the libsvm implementation seems to be using subtraction while the sklearn-porter's JavaScript predict method is using addition in the same place. I'm guessing both are the same if the intercepts are having opposite sign, but, I'm not sure. Could you please shed some light on this?

How to port a classifier?

Hi,

This is a great tool and I've been looking for a while.

A classifier such as GradientBoostingClassifier is needed, we have to write to you for a feature request. Actually, I don't mind implementing such one and giving it back to this repo. I checked the document and did not find any article about how to implement a classifier in sklearn.

It would be extremely helpful for this project and other users like me.

Thanks.

C Code generated is not correct.

Features is accessed inside the function predict. The scope of variable features is within main function. It should be either a global variable or passes as function parameter.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int predict(float atts[2]) {

    int classes[2];
        
    if (features[0] <= 5.43762493134) {
        if (features[1] <= 5.74491977692) {
            if (features[0] <= 3.51197504997) {
                classes[0] = 10; 
                classes[1] = 0; 
            } else {
                classes[0] = 0; 
                classes[1] = 1; 
            }
        } else {
            if (features[1] <= 16.6829204559) {
                if (features[0] <= 2.67515516281) {
                    if (features[1] <= 11.5629148483) {
                        if (features[1] <= 7.29798984528) {
                            if (features[1] <= 6.13995504379) {
                                classes[0] = 1; 
                                classes[1] = 0; 
                            } else {
                                classes[0] = 0; 
                                classes[1] = 3; 
                            }
                        } else {
                            if (features[0] <= 1.60292005539) {
                                if (features[1] <= 8.0366601944) {
                                    classes[0] = 3; 
                                    classes[1] = 0; 
                                } else {
                                    if (features[1] <= 9.11940002441) {
                                        classes[0] = 0; 
                                        classes[1] = 2; 
                                    } else {
                                        if (features[0] <= 1.21078002453) {
                                            if (features[0] <= 1.11364006996) {
                                                classes[0] = 1; 
                                                classes[1] = 0; 
                                            } else {
                                                classes[0] = 0; 
                                                classes[1] = 1; 
                                            }
                                        } else {
                                            classes[0] = 2; 
                                            classes[1] = 0; 
                                        }
                                    }
                                }
                            } else {
                                classes[0] = 6; 
                                classes[1] = 0; 
                            }
                        }
                    } else {
                        if (features[0] <= 2.35693502426) {
                            classes[0] = 0; 
                            classes[1] = 7; 
                        } else {
                            classes[0] = 1; 
                            classes[1] = 0; 
                        }
                    }
                } else {
                    if (features[1] <= 16.5127105713) {
                        if (features[1] <= 12.1385450363) {
                            if (features[1] <= 6.92804527283) {
                                if (features[1] <= 6.25199985504) {
                                    classes[0] = 0; 
                                    classes[1] = 4; 
                                } else {
                                    if (features[0] <= 5.02503490448) {
                                        classes[0] = 2; 
                                        classes[1] = 0; 
                                    } else {
                                        classes[0] = 0; 
                                        classes[1] = 1; 
                                    }
                                }
                            } else {
                                if (features[1] <= 10.6784753799) {
                                    classes[0] = 0; 
                                    classes[1] = 9; 
                                } else {
                                    if (features[1] <= 10.7935905457) {
                                        classes[0] = 1; 
                                        classes[1] = 0; 
                                    } else {
                                        classes[0] = 0; 
                                        classes[1] = 5; 
                                    }
                                }
                            }
                        } else {
                            if (features[0] <= 4.75841522217) {
                                if (features[0] <= 3.42268514633) {
                                    classes[0] = 1; 
                                    classes[1] = 0; 
                                } else {
                                    classes[0] = 0; 
                                    classes[1] = 5; 
                                }
                            } else {
                                classes[0] = 2; 
                                classes[1] = 0; 
                            }
                        }
                    } else {
                        classes[0] = 1; 
                        classes[1] = 0; 
                    }
                }
            } else {
                if (features[0] <= 4.17648506165) {
                    classes[0] = 6; 
                    classes[1] = 0; 
                } else {
                    if (features[0] <= 4.91468000412) {
                        classes[0] = 0; 
                        classes[1] = 3; 
                    } else {
                        classes[0] = 2; 
                        classes[1] = 0; 
                    }
                }
            }
        }
    } else {
        if (features[0] <= 7.70522975922) {
            if (features[0] <= 7.64461517334) {
                if (features[0] <= 6.52222013474) {
                    if (features[0] <= 6.49937534332) {
                        if (features[1] <= 8.1920003891) {
                            if (features[1] <= 8.07668018341) {
                                classes[0] = 0; 
                                classes[1] = 4; 
                            } else {
                                classes[0] = 1; 
                                classes[1] = 0; 
                            }
                        } else {
                            classes[0] = 0; 
                            classes[1] = 14; 
                        }
                    } else {
                        classes[0] = 1; 
                        classes[1] = 0; 
                    }
                } else {
                    if (features[1] <= 13.1301851273) {
                        classes[0] = 0; 
                        classes[1] = 41; 
                    } else {
                        if (features[1] <= 13.5656652451) {
                            classes[0] = 1; 
                            classes[1] = 0; 
                        } else {
                            classes[0] = 0; 
                            classes[1] = 7; 
                        }
                    }
                }
            } else {
                classes[0] = 1; 
                classes[1] = 0; 
            }
        } else {
            classes[0] = 0; 
            classes[1] = 183; 
        }
    }

    int index = 0;
    for (int i = 0; i < 2; i++) {
        index = classes[i] > classes[index] ? i : index;
    }
    return index;
}

int main(int argc, const char * argv[]) {

    /* Features: */
    double features[argc-1];
    int i;
    for (i = 1; i < argc; i++) {
        features[i-1] = atof(argv[i]);
    }

    /* Prediction: */
    printf("%d", predict(features));
    return 0;

}

MLPClassifier does not reset network value producing wrong predictions when doing continuous prediction

Hi, Great work with Porter really helpful!

The next is a small issue but one that took me a good time to debug so here I wanted to post as both a problem and a possible solution that seems to work for me.

I have been porting an MLPClassifier to android, everything seemed fine except that in java desktop tests the classifier worked fine but in android would usually produce not completely wrong but slightly off values. I kept running tests and found that the way MLPClassifier is implemented currently in Java stores the input values of the network in the object every time a prediction is made, what this means is that if the method .predict is run once any subsequent call will reuse values that were changed inside the network, with this I do not mean the weights but the actual input values and any subsequent estimations. This does not produce very different results but slightly off which makes it very hard to debug, initially, I thought this may have been just a rounding numbers issue. Also when running desktop tests you may run the suggested terminal test which inputs a single value, and hence this problem is impossible to catch that way as it only appears when you call .predict multiple times sequentially.

A way to fix this issue is by adding a method that resets the network values to zero.

public void reset(){
        //Cleans up the network values
        for (int i=0;i<this.network.length;i++){
            for (int i2=0;i2<this.network[i].length;i2++){
                this.network[i][i2]=0;
            }
        }
    }

The solution above has the caveat that it will assign a value of zero to the input values used in .predict since predict does not copy the values but instead uses a pointer.

Although deleting the MLPClassifier is another option or creating a new this.network is possible it may be much slower.

Hope this helps other people and if you have a better solution please let me know.

Prediction for ExtraTree model differs from sklearn (tested for C model)

I was trying to implement the predict_proba function for an Extra Tree model when I realized that the result returned by the transpiled version of the model differed from the one returned by sklearn.

My model contains 30 trees and 3 classes, below are the classes predicted by sklearn along side the probabilities for each estimator:

	Proba Class 0	Proba Class 1	Proba Class 2	Predicted class
Estimator 0	0.1765	0.0000	0.8235	2
Estimator 1	0.0000	0.0000	1.0000	2
Estimator 2	0.1667	0.0000	0.8333	2
Estimator 3	0.6923	0.0000	0.3077	0
Estimator 4	0.8125	0.0417	0.1458	0
Estimator 5	0.8374	0.0064	0.1562	0
Estimator 6	0.9727	0.0000	0.0273	0
Estimator 7	0.3429	0.0000	0.6571	2
Estimator 8	0.8391	0.0095	0.1514	0
Estimator 9	0.0000	0.0000	1.0000	2
Estimator 10	0.7266	0.0078	0.2656	0
Estimator 11	0.6220	0.0000	0.3780	0
Estimator 12	0.5000	0.0000	0.5000	0
Estimator 13	0.6117	0.0000	0.3883	0
Estimator 14	0.0000	0.0000	1.0000	2
Estimator 15	0.8687	0.0000	0.1313	0
Estimator 16	1.0000	0.0000	0.0000	0
Estimator 17	0.8468	0.0170	0.1362	0
Estimator 18	0.5595	0.0000	0.4405	0
Estimator 19	0.0714	0.0000	0.9286	2
Estimator 20	0.4600	0.0000	0.5400	2
Estimator 21	0.0000	0.0000	1.0000	2
Estimator 22	0.5217	0.0000	0.4783	0
Estimator 23	0.8322	0.0049	0.1629	0
Estimator 24	0.5000	0.0000	0.5000	0
Estimator 25	0.3333	0.0000	0.6667	2
Estimator 26	1.0000	0.0000	0.0000	0
Estimator 27	0.4545	0.0000	0.5455	2
Estimator 28	0.0000	0.0000	1.0000	2
Estimator 29	0.0000	0.0000	1.0000	2
MODEL	0.4916	0.0029	0.5055	2

17 estimators predict class 0 and 13 predict class 2 BUT the model predicts class 2 because it is the most probable class.

Therefore it seems to me that the transpiled model should also make its decision on the predicted probabilities.

What do you think?

nok / sklearn-porter Goto Github PK

sklearn-porter's Issues

Export:

After executing "PIP install sklearn Porter" in win7 + Python 3.5, I install 7.0 + by default. An error is reported: UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in positio egal multibyte sequence

Recommend Projects

Recommend Topics

Recommend Org

After executing "PIP install sklearn Porter" in win7 + Python 3.5, I install 7.0 + by default. An error is reported:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in positio
egal multibyte sequence