Giter Site home page Giter Site logo

sonijogesh / intel-extension-for-transformers Goto Github PK

View Code? Open in Web Editor NEW

This project forked from intel/intel-extension-for-transformers

0.0 0.0 0.0 191.39 MB

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

License: Apache License 2.0

Shell 0.18% JavaScript 0.12% C++ 44.62% Python 35.48% C 3.42% TypeScript 0.46% CSS 0.06% HTML 9.35% CMake 0.64% Jupyter Notebook 2.50% Dockerfile 0.19% Svelte 2.98%

intel-extension-for-transformers's Introduction

Intel® Extension for Transformers

An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere

Release Notes

🏭Architecture   |   💬NeuralChat   |   😃Inference   |   💻Examples   |   📖Documentations

🚀Latest News

  • [2023/10] LLM runtime, an Intel-optimized GGML compatible runtime, demonstrates up to 15x performance gain in 1st token generation and 1.5x in other token generation over the default llama.cpp.
  • [2023/10] LLM runtime now supports LLM inference with infinite-length inputs up to 4 million tokens, inspired from StreamingLLM.
  • [2023/09] NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next'23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors.
  • [2023/08] NeuralChat supports custom chatbot development and deployment within minutes on broad Intel HWs such as Xeon Scalable Processors, Gaudi2, Xeon CPU Max Series, Data Center GPU Max Series, Arc Series, and Core Processors. Check out Notebooks.
  • [2023/07] LLM runtime extends Hugging Face Transformers API to provide seamless low precision inference for popular LLMs, supporting low precision data types such as INT3/INT4/FP4/NF4/INT5/INT8/FP8.

🏃Installation

Quick Install from Pypi

pip install intel-extension-for-transformers

For more installation methods, please refer to Installation Page

🌟Introduction

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

🌱Getting Started

Below is the sample code to enable the chatbot. See more examples.

Chatbot

# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")

Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.

INT4 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
outputs = tokenizer.batch_decode(gen_tokens)

INT8 Inference

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int8")
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
gen_tokens = model.generate(inputs, max_new_tokens=300)
outputs = tokenizer.batch_decode(gen_tokens)

🎯Validated Models

You can access the latest int4 performance and accuracy at int4 blog.

Additionally, we are preparing to introduce Baichuan, Mistral, and other models into LLM Runtime (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.

📖Documentation

OVERVIEW
NeuralChat LLM Runtime
NEURALCHAT
Chatbot on Intel CPU Chatbot on Intel GPU Chatbot on Gaudi
Chatbot on Client More Notebooks
LLM RUNTIME
LLM Runtime Streaming LLM Low Precision Kernels Tensor Parallelism
LLM COMPRESSION
SmoothQuant (INT8) Weight-only Quantization (INT4/FP4/NF4/INT8) QLoRA on CPU
GENERAL COMPRESSION
Quantization Pruning Distillation Orchestration
Neural Architecture Search Export Metrics Objectives
Pipeline Length Adaptive Early Exit Data Augmentation
TUTORIALS & RESULTS
Tutorials LLM List General Model List Model Performance

🙌Demo

  • Infinite inference (up to 4M tokens)
streamingLLM_v2.mp4

📃Selected Publications/Events

View Full Publication List.

Additional Content

Acknowledgements

💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!

intel-extension-for-transformers's People

Contributors

vincyzhang avatar changwangss avatar penghuicheng avatar zhenwei-intel avatar lvliang-intel avatar xin3he avatar airmeng avatar a32543254 avatar zhenzhong1 avatar zhentaoyu avatar xinyuye-intel avatar hshen14 avatar zhewang1-intc avatar eason9393 avatar intellinjun avatar yi1ding avatar spycsh avatar ddele avatar xuhuiren avatar yuchengliu1 avatar n1ck-guo avatar kevinintel avatar sywangyi avatar violetch24 avatar luoyu-intel avatar lkk12014402 avatar letonghan avatar tofindoutmagic avatar jiafuzha avatar sunjiweiswift avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.