abelriboulot / onnxt5 Goto Github PK
View Code? Open in Web Editor NEWSummarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
License: Apache License 2.0
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
License: Apache License 2.0
In current benchmark results, ONNX is slower than PyTorch above 500 words. I think the cause is the OnnxRuntime API used in inference:
Line 121 in 2844749
For GPU inference, that API need extra memory copy (from CPU to GPU for input tensors, and from GPU to CPU for output tensors). When sequence length is large, the IO latency might be significant.
I suggest to try OnnxRuntime IO Binding to avoid extra memory copy.
Hello , i can't run the first exemple ,
from onnxt5 import GenerativeT5
from onnxt5.api import get_encoder_decoder_tokenizer
decoder_sess, encoder_sess, tokenizer = get_encoder_decoder_tokenizer()
generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)
prompt = 'translate English to French: I was a victim of a series of accidents.'
output_text, output_logits = generative_t5(prompt, max_length=100, temperature=0.)
# output_text: "J'ai été victime d'une série d'accidents."
the model begin calculation but before End, i have this error :
TypeError Traceback (most recent call last)
<ipython-input-1-257f12b63043> in <module>
5 prompt = 'translate English to French: I was a victim of a series of accidents.'
6
----> 7 output_text, output_logits = generative_t5(prompt, max_length=16, temperature=0.)
8 # output_text: "J'ai été victime d'une série d'accidents."
~\Anaconda3\envs\onnxt5\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
~\Anaconda3\envs\onnxt5\lib\site-packages\onnxt5\models.py in forward(self, prompt, max_length, temperature, repetition_penalty, top_k, top_p, max_context_length)
145 new_tokens.append(next_token)
146
--> 147 return self.tokenizer.decode(new_tokens), new_logits
~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils_base.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
3000 skip_special_tokens=skip_special_tokens,
3001 clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-> 3002 **kwargs,
3003 )
3004
~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens)
730 spaces_between_special_tokens: bool = True,
731 ) -> str:
--> 732 filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
733
734 # To avoid mixing byte-level and unicode for byte-level BPT
~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils.py in convert_ids_to_tokens(self, ids, skip_special_tokens)
708 tokens = []
709 for index in ids:
--> 710 index = int(index)
711 if skip_special_tokens and index in self.all_special_ids:
712 continue
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list
`
and i have no idea how to find solution , if you have any solution !? thx !
<extra_id_0> the company<extra_id_1> the company<extra_id_2>.<extra_id_3>.<extra_id_4>.<extra_id_5>.<extra_id_6>. <extra_id_7>.
Do I need some postprocessing? Or it is an issue?
At the moment the context expands indefinitely whereas the self-attention doesn't.
86%|████████▌ | 18/21 [00:00<00:00, 44.29it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-f543e3365977> in <module>()
27 # Generating text
28 generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)
---> 29 generative_t5('translate English to French: I was a victim of a series of accidents.', 21, temperature=0.)[0]
3 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
505 if isinstance(token_ids, int):
506 token_ids = [token_ids]
--> 507 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
508
509 if clean_up_tokenization_spaces:
TypeError: 'float' object cannot be interpreted as an integer
Any possible version conflicts that you know of?
The relevant function is download_generation_model in api.py
Hi there,
For text translation tasks, can the onnx model be inferenced using cpu only.
how much ram does it require
@abelriboulot , @Ki6an , @brymck .
I have finetuned t5 model for paraphrasing task like this: Paraphrase with t5
I want to reduce inference time, so I exported finetuned t5 model using onnxt5, here I get time taken more in case where
I use onnx model on gpu than pytorch model on gpu.
gpu:
time taken = 0.2357314471155405
time taken = 0.24958523781970143
time taken = 0.20342689706012607
time taken = 0.5490081580355763
time taken = 0.10756197292357683
onnxt5-gpu
time taken = 0.5277913622558117
time taken = 0.6335883080027997
time taken = 0.6975196991115808
time taken = 1.9159171842038631
time taken = 0.7938353712670505
Did I make mistake in exporting/loading model ?
gpu code
onnxt5-gpu code
In folder onnxt5 models.py, in lines 118, 124, 125, 126 there are totally 4 useless assignment.
hi @abelriboulot
How can i calculate cosine similarity between two sentences using the encode and decoder embeddings?
Recently, I use the chinese function of multilingual-t5 model to accomplish the Chinese NLG tasks. However, the inference speed might be slow, could this model be used for multilingual-t5? How can I do?
Specifically, I am using google/flan-t5-large in a Colab, but the inference time is rather slow for my needs. Can it benefit from onnxt5?
Does this support quantized models by any chance?
How to suppress output?
Setting verbosity logging level does nothing
5%|█████████▊ | 16/300 [00:01<00:18, 15.65it/s]
A next step for better generation is to implement a beam search for the generation. An example of it can be seen on the huggingface repo here, and this would need adding such a function to the GenerativeT5 model in onnxt5/models.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.