Giter Site home page Giter Site logo

abelriboulot / onnxt5 Goto Github PK

View Code? Open in Web Editor NEW
245.0 8.0 30.0 1.01 MB

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

License: Apache License 2.0

Python 74.83% Jupyter Notebook 25.17%
nlp nlp-machine-learning onnx onnxruntime transformer transformers inference summarization translation text-generation

onnxt5's Introduction

ONNX T5 Actions Status Actions Status Version Downloads Slack

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX.

This package is still in alpha stage, therefore some functionalities such as beam searches are still in development.

Installation

ONNX-T5 is available on PyPi.

pip install onnxt5

For the dev version you can run the following.

git clone https://github.com/abelriboulot/onnxt5
cd onnxt5
pip install -e .

Usage

The simplest way to get started for generation is to use the default pre-trained version of T5 on ONNX included in the package.

NOTE: Please note that the first time you call get_encoder_decoder_tokenizer, the models are being downloaded which might take a minute or two.

from onnxt5 import GenerativeT5
from onnxt5.api import get_encoder_decoder_tokenizer
decoder_sess, encoder_sess, tokenizer = get_encoder_decoder_tokenizer()
generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)
prompt = 'translate English to French: I was a victim of a series of accidents.'

output_text, output_logits = generative_t5(prompt, max_length=100, temperature=0.)
# output_text: "J'ai été victime d'une série d'accidents."

Other tasks just require to change the prefix in your prompt, for instance for summarization:

prompt = 'summarize: <PARAGRAPH>'
output_text, output_logits = generative_t5(prompt, max_length=100, temperature=0.)

If you want to get the embeddings of text, you can run the following

from onnxt5.api import get_encoder_decoder_tokenizer, run_embeddings_text

decoder_sess, encoder_sess, tokenizer = get_encoder_decoder_tokenizer()
prompt = 'Listen, Billy Pilgrim has come unstuck in time.'
encoder_embeddings, decoder_embeddings = run_embeddings_text(encoder_sess, decoder_sess, tokenizer, prompt)

ONNXT5 also lets you export and use your own models. See the examples\ folder for more detailed examples.

T5 works with tokens such as summarize:, translate English to German:, or question: ... context:. You can see a list of the pretrained tasks and token in the appendix D of the original paper.

Functionalities

  • Run any of the T5 trained tasks in a line (translation, summarization, sentiment analysis, completion, generation)
  • Export your own T5 models to ONNX easily
  • Utility functions to generate what you need quickly
  • Up to 4X speedup compared to PyTorch execution for smaller contexts

Benchmarks

The outperformance varies heavily based on the length of the context. For contexts less than ~500 words, ONNX outperforms greatly, going up to a 4X speedup compared to PyTorch. However, the longer the context, the smaller the speedup of ONNX, with Pytorch being faster above 500 words.

GPU Benchmark, Embedding Task

Benchmark Embedding

GPU Benchmark, Generation Task

Benchmark Generation

Contributing

The project is still in its infancy, so I would love your feedback, to know what problems you are trying to solve, hear issues you're encountering, and discuss features that would help you. Therefore feel free to shoot me an e-mail (see my profile for the address!) or join our slack community.

Acknowledgements

This repo is based on the work of Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu from Google, as well as the implementation of T5 from the huggingface team, the work of the Microsoft ONNX and onnxruntime teams, in particular Tianlei Wu, and the work of Thomas Wolf on generation of text.

Original T5 Paper

@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

Microsoft onnxruntime repo

HuggingFace implementation of T5

onnxt5's People

Contributors

brymck avatar ki6an avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onnxt5's Issues

Inference time on gpu vs onnxt5-gpu

@abelriboulot , @Ki6an , @brymck .
I have finetuned t5 model for paraphrasing task like this: Paraphrase with t5

I want to reduce inference time, so I exported finetuned t5 model using onnxt5, here I get time taken more in case where
I use onnx model on gpu than pytorch model on gpu.

gpu:
time taken = 0.2357314471155405
time taken = 0.24958523781970143
time taken = 0.20342689706012607
time taken = 0.5490081580355763
time taken = 0.10756197292357683

onnxt5-gpu
time taken = 0.5277913622558117
time taken = 0.6335883080027997
time taken = 0.6975196991115808
time taken = 1.9159171842038631
time taken = 0.7938353712670505

Did I make mistake in exporting/loading model ?
gpu code
onnxt5-gpu code

How to suppress output

How to suppress output?
Setting verbosity logging level does nothing
5%|█████████▊ | 16/300 [00:01<00:18, 15.65it/s]

Running example "export_pretrained_model.py" as-is fails (See details)

86%|████████▌ | 18/21 [00:00<00:00, 44.29it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-f543e3365977> in <module>()
     27 # Generating text
     28 generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)
---> 29 generative_t5('translate English to French: I was a victim of a series of accidents.', 21, temperature=0.)[0]

3 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    505         if isinstance(token_ids, int):
    506             token_ids = [token_ids]
--> 507         text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    508 
    509         if clean_up_tokenization_spaces:

TypeError: 'float' object cannot be interpreted as an integer

Any possible version conflicts that you know of?

Use OnnxRuntime IO Binding to improve GPU inference performance

In current benchmark results, ONNX is slower than PyTorch above 500 words. I think the cause is the OnnxRuntime API used in inference:

encoder_outputs_prompt = self.encoder.run(None, {"input_ids": generated.cpu().numpy()})[0]

For GPU inference, that API need extra memory copy (from CPU to GPU for input tensors, and from GPU to CPU for output tensors). When sequence length is large, the IO latency might be significant.

I suggest to try OnnxRuntime IO Binding to avoid extra memory copy.

cpu only inferencing

Hi there,
For text translation tasks, can the onnx model be inferenced using cpu only.
how much ram does it require

Can this be used with Flan-T5?

Specifically, I am using google/flan-t5-large in a Colab, but the inference time is rather slow for my needs. Can it benefit from onnxt5?

int() argument must be a string , when running exemple.

Hello , i can't run the first exemple ,

from onnxt5 import GenerativeT5
from onnxt5.api import get_encoder_decoder_tokenizer

decoder_sess, encoder_sess, tokenizer = get_encoder_decoder_tokenizer()
generative_t5 = GenerativeT5(encoder_sess, decoder_sess, tokenizer, onnx=True)
prompt = 'translate English to French: I was a victim of a series of accidents.'

output_text, output_logits = generative_t5(prompt, max_length=100, temperature=0.)
 # output_text: "J'ai été victime d'une série d'accidents." 

the model begin calculation but before End, i have this error :

TypeError                                 Traceback (most recent call last)
<ipython-input-1-257f12b63043> in <module>
      5 prompt = 'translate English to French: I was a victim of a series of accidents.'
      6 
----> 7 output_text, output_logits = generative_t5(prompt, max_length=16, temperature=0.)
      8 # output_text: "J'ai été victime d'une série d'accidents."

~\Anaconda3\envs\onnxt5\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

~\Anaconda3\envs\onnxt5\lib\site-packages\onnxt5\models.py in forward(self, prompt, max_length, temperature, repetition_penalty, top_k, top_p, max_context_length)
    145                 new_tokens.append(next_token)
    146 
--> 147             return self.tokenizer.decode(new_tokens), new_logits

~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils_base.py in decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3000             skip_special_tokens=skip_special_tokens,
   3001             clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-> 3002             **kwargs,
   3003         )
   3004 

~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils.py in _decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, spaces_between_special_tokens)
    730         spaces_between_special_tokens: bool = True,
    731     ) -> str:
--> 732         filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
    733 
    734         # To avoid mixing byte-level and unicode for byte-level BPT

~\Anaconda3\envs\onnxt5\lib\site-packages\transformers\tokenization_utils.py in convert_ids_to_tokens(self, ids, skip_special_tokens)
    708         tokens = []
    709         for index in ids:
--> 710             index = int(index)
    711             if skip_special_tokens and index in self.all_special_ids:
    712                 continue

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list

`

and i have no idea how to find solution , if you have any solution !? thx !

Implement beam search

A next step for better generation is to implement a beam search for the generation. An example of it can be seen on the huggingface repo here, and this would need adding such a function to the GenerativeT5 model in onnxt5/models.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.