Comments (6)
from onnx.
mlprodict is archived because its main functionality, a python runtime for onnx called OnnxInference, was moved into onnx package as ReferenceEvaluator. I did not write an example about the trick I put in place to avoid ambiguities but it is maintained in unit tests: https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_text.py#L11. This issue is not mentioned in onnx documentation because the ambiguity comes from scikit-learn and not onnx. This case usually don't happen unless the n-grams contain spaces. In that case, the converter gets confused.
from onnx.
Thank you for the quick reply! Using the test cases as a model, I wrote a sample program and I was not having success when using any of the min_df, max_df or max_features parameters.
It looks like convert_sklearn_text_vectorizer
is expecting stop_words_
to be a set of string but when min_df et al are set, it is being set to a set of tuples, so as a result running to_onnx on it will fail with an error like this:
TypeError: One stop word is not a string ('is', 'this') in stop_words={('is', 'this'), ...
from onnx.
I've sort of fixed my problem by setting stop_words to an empty set and moving on. However, I'm still seeing some discrepancies and tracked it down to what looks like single-char tokens not being skipped by ONNX?
to reproduce:
import numpy as np
import skl2onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer, TraceableCountVectorizer
from skl2onnx import update_registered_converter
from skl2onnx.shape_calculators.text_vectorizer import (
calculate_sklearn_text_vectorizer_output_shapes,
)
from skl2onnx.operator_converters.text_vectoriser import convert_sklearn_text_vectorizer
from skl2onnx.operator_converters.tfidf_vectoriser import convert_sklearn_tfidf_vectoriser
from skl2onnx.common.data_types import StringTensorType
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession
update_registered_converter(
TraceableCountVectorizer,
"Skl2onnxTraceableCountVectorizer",
calculate_sklearn_text_vectorizer_output_shapes,
convert_sklearn_text_vectorizer,
options={
"tokenexp": None,
"separators": None,
"nan": [True, False],
"keep_empty_string": [True, False],
"locale": None,
},
)
update_registered_converter(
TraceableTfidfVectorizer,
"Skl2onnxTraceableTfidfVectorizer",
calculate_sklearn_text_vectorizer_output_shapes,
convert_sklearn_tfidf_vectoriser,
options={
"tokenexp": None,
"separators": None,
"nan": [True, False],
"keep_empty_string": [True, False],
"locale": None,
},
)
corpus = np.array([
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
"first Z document", # uncomment this and the last assertion fails because ONNX still sees the "Z" token when it shouldn't?
"",
]
)
FEATURES = 12
vect = TraceableTfidfVectorizer(ngram_range=(1,2), max_features=FEATURES)
vect.fit(corpus)
print('vocabulary:', vect.vocabulary_)
print('stop_words:', vect.stop_words_)
A = vect.transform(corpus).todense()
vect.stop_words_ = set() # should be safe right?
B = vect.transform(corpus).todense()
assert_almost_equal(A,B) # make sure vectorizer still does what we expect
model_onnx = skl2onnx.to_onnx(
vect,
"TfidfVectorizer",
initial_types=[("input", StringTensorType([None, 1]))],
target_opset=12,
)
sess = InferenceSession(model_onnx.SerializeToString(), providers=['CPUExecutionProvider'])
inputs = {"input": corpus.reshape((-1, 1))}
out = sess.run(None, inputs)[0]
assert out.shape == (len(corpus), FEATURES)
assert_almost_equal(out.ravel(), A.A1)
sklearn gives the same vector for both strings as expected:
vect.transform(["first document", "first Z document"]).todense()
matrix([[0. , 0.51822427, 0.60474937, 0.60474937, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0.51822427, 0.60474937, 0.60474937, 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])
And the model from onnx is is only matching the results on the first string:
sess.run(None, {'input': [["first document"], ["first Z document"]]})[0]
array([[0. , 0.5182243 , 0.6047494 , 0.6047494 , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ],
[0. , 0.65069556, 0.7593387 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]], dtype=float32)
from onnx.
Hi, I haven't understood the details here. ONNX is a specification. The issue could either be in the conversion side or it could be a bug in the onnxruntime implementation. If it is the latter, I would recommend filing an issue in the onnxruntime repo. Do you know?
Xavier might have a better idea (he is off this week).
from onnx.
Nope, I did not know. I don't have a full image of all the related projects, ownership and relations between them. Thanks for the link though, I thought I found a related issue reported there but, I don't think that's it.
I was able to resolve my issue by explicitly passing "tokenexp": r"\b\w\w+\b"
in the options. I don't think this substitution is correct:
https://github.com/onnx/sklearn-onnx/blob/d2029c1a9752f62a63fc5c4447b4d9fe75e8fe39/skl2onnx/operator_converters/text_vectoriser.py#L238
It should be something like [a-zA-Z0-9_]{2,}
or \b\w\w+\b
. Is the reason for this replacement because the onnx runtime chokes on the (?u)
that sklearn has in it's default for token_pattern?
from onnx.
Related Issues (20)
- Reference implementation of ONNX Reduce sum square is mismatch with ONNX Spec when noop_with_empty_axes == 1 HOT 5
- Clarification of Reshape semantics for attribute 'allowzero' NOT set, zero volume HOT 3
- Change dynamic shape in fixed shape using C++ HOT 1
- The weekly model zoo CI has been failing HOT 12
- Incorrect input names for quantize/dequantize ONNX backend node tests HOT 1
- Fix and test numpy_helper to_array
- How to read onnx and obtain names, imgsz HOT 3
- Model zoo test failures HOT 1
- DequantizeLinear spec clarification: What happens if the subtraction overflows/underflows? HOT 1
- the model for UniqueOp in backend test is not always correct for different input
- A conflict doc abount compatibility between Onnx Version & ML Opset Version HOT 2
- [Shape Inference] Robustness of `ConstantOfShape` operator HOT 2
- Numpy 2.0.0rc2 causes `pytest` on ONNX to fail HOT 2
- How to modify Slice attribute HOT 4
- Error While Installing ONNX HOT 3
- Export onnx model from pytorch with dynamic input size, output tensor doesn't match. HOT 9
- Discussion on the Correct Approach for Converting ONNX Models to Other Frameworks HOT 2
- Way to swap parameters inside .onnx graph (LoRA functionality) HOT 6
- docs/Operators.md missing input type description at Constant HOT 1
- React Native Support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onnx.