Giter Site home page Giter Site logo

Comments (6)

justinchuby avatar justinchuby commented on June 15, 2024

cc @xadupre @gramalingam

from onnx.

xadupre avatar xadupre commented on June 15, 2024

mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.

from onnx.

geofforigin8 avatar geofforigin8 commented on June 15, 2024

Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.

It looks like convert_sklearn_text_vectorizer is expecting stop_words_ to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...

from onnx.

geofforigin8 avatar geofforigin8 commented on June 15, 2024

I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?

to reproduce:

import numpy as np 
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
    calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession


update_registered_converter(
    TraceableCountVectorizer,
    "Skl2onnxTraceableCountVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_text_vectorizer,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

update_registered_converter(
    TraceableTfidfVectorizer,
    "Skl2onnxTraceableTfidfVectorizer",
    calculate_sklearn_text_vectorizer_output_shapes,
    convert_sklearn_tfidf_vectoriser,
    options={
        "tokenexp": None,
        "separators": None,
        "nan": [True, False],
        "keep_empty_string": [True, False],
        "locale": None,
    },
)

corpus = np.array([
        "This is the first document.",
        "This document is the second document.",
        "And this is the third one.",
        "Is this the first document?",
        "first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
        "",
    ]
)

FEATURES = 12

vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)

print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)

A = vect.transform(corpus).todense()

vect.stop_words_ = set() # should be safe right?

B = vect.transform(corpus).todense()
assert_almost_equal(A,B)  # make sure vectorizer still does what we expect

model_onnx = skl2onnx.to_onnx(
    vect,
    "TfidfVectorizer",
    initial_types=[("input", StringTensorType([None, 1]))],
    target_opset=12,
)

sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])

inputs = {"input": corpus.reshape((-1, 1))}

out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)

sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()

matrix([[0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.51822427, 0.60474937, 0.60474937, 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ]])

And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]

array([[0.        , 0.5182243 , 0.6047494 , 0.6047494 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.65069556, 0.7593387 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ]], dtype=float32)

from onnx.

gramalingam avatar gramalingam commented on June 15, 2024

Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?

Xavier might have a better idea (he is off this week).

from onnx.

geofforigin8 avatar geofforigin8 commented on June 15, 2024

Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.

I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b" in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238

It should be something like [a-zA-Z0-9_]{2,} or \b\w\w+\b. Is the reason for this replacement because the onnx runtime chokes on the (?u) that sklearn has in it's default for token_pattern?

from onnx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.