Giter Site home page Giter Site logo

zhouleidcc / neural-compressor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from intel/neural-compressor

0.0 0.0 0.0 70.23 MB

Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision, sparsity, pruning, knowledge distillation, cross different deep learning frameworks to purse best inference performance.

License: Apache License 2.0

Python 87.92% JavaScript 0.02% TypeScript 2.85% HTML 1.53% SCSS 0.64% Makefile 0.02% CSS 0.01% Batchfile 0.02% CMake 0.16% C++ 6.82%

neural-compressor's Introduction

Introduction to Intel® Neural Compressor

Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool) is an open-source Python library running on Intel CPUs and GPUs, which delivers unified interfaces across multiple deep learning frameworks for popular network compression technologies, such as quantization, pruning, knowledge distillation. This tool supports automatic accuracy-driven tuning strategies to help user quickly find out the best quantized model. It also implements different weight pruning algorithms to generate pruned model with predefined sparsity goal and supports knowledge distillation to distill the knowledge from the teacher model to the student model.

Note

GPU support is under development.

Visit the Intel® Neural Compressor online document website at: https://intel.github.io/neural-compressor.

Architecture

Intel® Neural Compressor features an infrastructure and workflow that aids in increasing performance and faster deployments across architectures.

Infrastructure

Infrastructure

Click the image to enlarge it.

Workflow

Workflow

Click the image to enlarge it.

Supported Frameworks

Supported deep learning frameworks are:

Note: Intel Optimized TensorFlow 2.5.0 requires to set environment variable TF_ENABLE_MKL_NATIVE_FORMAT=0 before running Neural Compressor quantization or deploying the quantized model.

Note: From the official TensorFlow 2.6.0, oneDNN support has been upstreamed. Download the official TensorFlow 2.6.0 binary for the CPU device and set the environment variable TF_ENABLE_ONEDNN_OPTS=1 before running the quantization process or deploying the quantized model.

Installation

Select the installation based on your operating system.

Linux Installation

You can install Neural Compressor using one of three options: Install just the library from binary or source, or get the Intel-optimized framework together with the library by installing the Intel® oneAPI AI Analytics Toolkit.

Option 1 Install from binary

# install stable version from pip
pip install neural-compressor

# install nightly version from pip
pip install -i https://test.pypi.org/simple/ neural-compressor

# install stable version from from conda
conda install neural-compressor -c conda-forge -c intel 

Option 2 Install from source

git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
python setup.py install

Option 3 Install from AI Kit

The Intel® Neural Compressor library is released as part of the Intel® oneAPI AI Analytics Toolkit (AI Kit). The AI Kit provides a consolidated package of Intel's latest deep learning and machine optimizations all in one place for ease of development. Along with Neural Compressor, the AI Kit includes Intel-optimized versions of deep learning frameworks (such as TensorFlow and PyTorch) and high-performing Python libraries to streamline end-to-end data science and AI workflows on Intel architectures.

The AI Kit is distributed through many common channels, including from Intel's website, YUM, APT, Anaconda, and more. Select and download the AI Kit distribution package that's best suited for you and follow the Get Started Guide for post-installation instructions.

Download AI Kit AI Kit Get Started Guide

Windows Installation

Prerequisites

The following prerequisites and requirements must be satisfied for a successful installation:

  • Python version: 3.6 or 3.7 or 3.8 or 3.9

  • Download and install anaconda.

  • Create a virtual environment named nc in anaconda:

    # Here we install python 3.7 for instance. You can also choose python 3.6, 3.8, or 3.9.
    conda create -n nc python=3.7
    conda activate nc

Installation options

Option 1 Install from binary

# install stable version from pip
pip install neural-compressor

# install nightly version from pip
pip install -i https://test.pypi.org/simple/ neural-compressor

# install from conda
conda install neural-compressor -c conda-forge -c intel 

Option 2 Install from source

git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
python setup.py install

Note: To run examples from neural-compressor version<1.7 with neural-compressor binary version>=1.7, please replace model name in main.py.

sed -i "s|lpot|neural_compressor|g" main.py

Documentation

Get Started

  • APIs explains Intel® Neural Compressor's API.
  • Transform introduces how to utilize Neural Compressor's built-in data processing and how to develop a custom data processing method.
  • Dataset introduces how to utilize Neural Compressor's built-in dataset and how to develop a custom dataset.
  • Metric introduces how to utilize Neural Compressor's built-in metrics and how to develop a custom metric.
  • Tutorial provides comprehensive instructions on how to utilize Neural Compressor's features with examples.
  • Examples are provided to demonstrate the usage of Neural Compressor in different frameworks: TensorFlow, PyTorch, MXNet, and ONNX Runtime.
  • Intel® Neural Compressor Bench is a web-based system used to simplify Intel® Neural Compressor usage.
  • Intel oneAPI AI Analytics Toolkit Get Started Guide explains the AI Kit components, installation and configuration guides, and instructions for building and running sample apps.
  • AI and Analytics Samples includes code samples for Intel oneAPI libraries.

Deep Dive

  • Quantization are processes that enable inference and training by performing computations at low-precision data types, such as fixed-point integers. Neural Compressor supports Post-Training Quantization (PTQ) with different quantization capabilities and Quantization-Aware Training (QAT). Note that (Dynamic Quantization) currently has limited support.
  • Pruning provides a common method for introducing sparsity in weights and activations.
  • Knowledge Distillation provides a common method for distilling knowledge from teacher model to student model.
  • Distributed Training introduces how to leverage Horovod to do multi-node training in Intel® Neural Compressor to speed up the training time.
  • Benchmarking introduces how to utilize the benchmark interface of Neural Compressor.
  • Mixed precision introduces how to enable mixed precision, including BFP16 and int8 and FP32, on Intel platforms during tuning.
  • Graph Optimization introduces how to enable graph optimization for FP32 and auto-mixed precision.
  • Model Conversion introduces how to convert TensorFlow QAT model to quantized model running on Intel platforms.
  • TensorBoard provides tensor histograms and execution graphs for tuning debugging purposes.

Advanced Topics

  • Engine is a new backend supported by Intel® Neural Compressor to support domain-specific acceleration for NLP models.
  • Adaptor is the interface between components and framework. The method to develop adaptor extension is introduced with ONNX Runtime as example.
  • Strategy can automatically optimized low-precision recipes for deep learning models to achieve optimal product objectives like inference performance and memory usage with expected accuracy criteria. The method to develop a new strategy is introduced.

Publications

Full publication list please refers to here

System Requirements

Intel® Neural Compressor supports systems based on Intel 64 architecture or compatible processors, specially optimized for the following CPUs:

  • Intel Xeon Scalable processor (formerly Skylake, Cascade Lake, Cooper Lake, and Icelake)
  • future Intel Xeon Scalable processor (code name Sapphire Rapids)

Intel® Neural Compressor requires installing the Intel-optimized framework version for the supported DL framework you use: TensorFlow, PyTorch, MXNet, or ONNX runtime.

Note: Intel Neural Compressor supports Intel-optimized and official frameworks for some TensorFlow versions. Refer to Supported Frameworks for specifics.

Validated Hardware/Software Environment

Platform OS Python Framework Version
Cascade Lake

Cooper Lake

Skylake

Ice Lake
CentOS 8.3

Ubuntu 18.04
3.6

3.7

3.8

3.9
TensorFlow 2.6.0
2.5.0
2.4.0
2.3.0
2.2.0
2.1.0
1.15.0 UP1
1.15.0 UP2
1.15.0 UP3
1.15.2
PyTorch 1.5.0+cpu
1.8.0+cpu
1.9.0+cpu
IPEX
MXNet 1.8.0
1.7.0
1.6.0
ONNX Runtime 1.6.0
1.7.0
1.8.0

Validated Models

Intel® Neural Compressor provides numerous examples to show promising accuracy loss with the best performance gain. A full quantized model list on various frameworks is available in the Model List.

Validated MLPerf Models

Model Framework Support Example
ResNet50 v1.5 TensorFlow Yes Link
PyTorch Yes Link
DLRM PyTorch Yes Link
BERT-large TensorFlow Yes Link
PyTorch Yes Link
SSD-ResNet34 TensorFlow Yes Link
PyTorch Yes Link
RNN-T PyTorch WIP
3D-UNet TensorFlow WIP
PyTorch Yes Link

Validated Quantized Models

Framework Version Model Accuracy Performance
INT8 Tuning Accuracy FP32 Accuracy Baseline Acc Ratio [(INT8-FP32)/FP32] INT8 realtime(ms)
CLX8280 1s 4c per instance
FP32 realtime(ms)
CLX8280 1s 4c per instance
Realtime Latency Ratio[FP32/INT8]
tensorflow 2.5.0 resnet50v1.0 74.24% 74.27% -0.04% 7.56 21.24 2.81x
tensorflow 2.5.0 resnet50v1.5 76.94% 76.46% 0.63% 9.64 24.86 2.58x
tensorflow 2.5.0 resnet101 77.21% 76.45% 0.99% 12.73 30.80 2.42x
tensorflow 2.5.0 inception_v1 70.30% 69.74% 0.80% 5.57 9.92 1.78x
tensorflow 2.5.0 inception_v2 74.27% 73.97% 0.41% 6.69 12.33 1.84x
tensorflow 2.5.0 inception_v3 77.29% 76.75% 0.70% 12.90 27.46 2.13x
tensorflow 2.5.0 inception_v4 80.36% 80.27% 0.11% 20.88 54.13 2.59x
tensorflow 2.5.0 inception_resnet_v2 80.42% 80.40% 0.02% 44.47 87.69 1.97x
tensorflow 2.5.0 mobilenetv1 73.93% 70.96% 4.19% 2.95 10.12 3.43x
tensorflow 2.5.0 mobilenetv2 71.96% 71.76% 0.28% 4.97 10.39 2.09x
tensorflow 2.5.0 ssd_resnet50_v1 37.91% 38.00% -0.24% 140.46 411.03 2.93x
tensorflow 2.5.0 ssd_mobilenet_v1 23.02% 23.13% -0.48% 12.25 26.90 2.20x
tensorflow 2.5.0 ssd_resnet34 21.97% 22.16% -0.86% 264.26 960.48 3.63x
Framework Version Model Accuracy Performance
INT8 Tuning Accuracy FP32 Accuracy Baseline Acc Ratio [(INT8-FP32)/FP32] INT8 realtime(ms)
CLX8280 1s 4c per instance
FP32 realtime(ms)
CLX8280 1s 4c per instance
Realtime Latency Ratio[FP32/INT8]
pytorch 1.9.0+cpu resnet18 69.58% 69.76% -0.26% 14.21 26.55 1.87x
pytorch 1.9.0+cpu resnet50 75.87% 76.13% -0.34% 24.89 53.84 2.16x
pytorch 1.9.0+cpu resnext101_32x8d 79.09% 79.31% -0.28% 64.03 147.51 2.30x
pytorch 1.9.0+cpu bert_base_mrpc 88.16% 88.73% -0.64% 41.15 81.56 1.98x
pytorch 1.9.0+cpu bert_base_cola 58.29% 58.84% -0.93% 39.17 83.42 2.13x
pytorch 1.9.0+cpu bert_base_sts-b 88.65% 89.27% -0.70% 39.59 83.07 2.10x
pytorch 1.9.0+cpu bert_base_sst-2 91.63% 91.86% -0.25% 39.39 83.17 2.11x
pytorch 1.9.0+cpu bert_base_rte 69.31% 69.68% -0.52% 39.51 81.84 2.07x
pytorch 1.9.0+cpu bert_large_mrpc 87.48% 88.33% -0.95% 112.80 281.91 2.50x
pytorch 1.9.0+cpu bert_large_squad 92.78988 93.04683 -0.28% 503.92 934.01 1.85x
pytorch 1.9.0+cpu bert_large_qnli 91.12% 91.82% -0.76% 111.08 289.13 2.60x
pytorch 1.9.0+cpu bert_large_rte 72.92% 72.56% 0.50% 151.93 298.53 1.96x
pytorch 1.9.0+cpu bert_large_cola 62.85% 62.57% 0.45% 113.04 285.43 2.52x

Validated Pruning Models

Tasks FWK Model fp32 baseline gradient sensitivity with 20% sparsity +onnx dynamic quantization on pruned model
accuracy% drop% perf gain (sample/s) accuracy% drop% perf gain (sample/s)
SST-2 pytorch bert-base accuracy = 92.32 accuracy = 91.97 -0.38 1.30x accuracy = 92.20 -0.13 1.86x
QQP pytorch bert-base [accuracy, f1] = [91.10, 88.05] [accuracy, f1] = [89.97, 86.54] [-1.24, -1.71] 1.32x [accuracy, f1] = [89.75, 86.60] [-1.48, -1.65] 1.81x
Tasks FWK Model fp32 baseline Pattern Lock on 70% Unstructured Sparsity Pattern Lock on 50% 1:2 Structured Sparsity
accuracy% drop% accuracy% drop%
MNLI pytorch bert-base [m, mm] = [84.57, 84.79] [m, mm] = [82.45, 83.27] [-2.51, -1.80] [m, mm] = [83.20, 84.11] [-1.62, -0.80]
SST-2 pytorch bert-base accuracy = 92.32 accuracy = 91.51 -0.88 accuracy = 92.20 -0.13
QQP pytorch bert-base [accuracy, f1] = [91.10, 88.05] [accuracy, f1] = [90.48, 87.06] [-0.68, -1.12] [accuracy, f1] = [90.92, 87.78] [-0.20, -0.31]
QNLI pytorch bert-base accuracy = 91.54 accuracy = 90.39 -1.26 accuracy = 90.87 -0.73
QnA pytorch bert-base [em, f1] = [79.34, 87.10] [em, f1] = [77.27, 85.75] [-2.61, -1.54] [em, f1] = [78.03, 86.50] [-1.65, -0.69]
Framework Model fp32 baseline Compression dataset acc(drop)%
Pytorch resnet18 69.76 30% sparsity on magnitude ImageNet 69.47(-0.42)
Pytorch resnet18 69.76 30% sparsity on gradient sensitivity ImageNet 68.85(-1.30)
Pytorch resnet50 76.13 30% sparsity on magnitude ImageNet 76.11(-0.03)
Pytorch resnet50 76.13 30% sparsity on magnitude and post training quantization ImageNet 76.01(-0.16)
Pytorch resnet50 76.13 30% sparsity on magnitude and quantization aware training ImageNet 75.90(-0.30)

Validated Knowledge Distillation Examples

Example Name Dataset Student
(Accuracy)
Teacher
(Accuracy)
Student With Distillation
(Accuracy Improvement)
ResNet example ImageNet ResNet18
(0.6739)
ResNet50
(0.7399)
0.6845
(0.0106)
BlendCnn example MRPC BlendCnn
(0.7034)
BERT-Base
(0.8382)
0.7034
(0)
BiLSTM example SST-2 BiLSTM
(0.7913)
RoBERTa-Base
(0.9404)
0.8085
(0.0172)

Validated Engine Examples

Model Accuracy Performance
INT8 Tuning Accuracy FP32 Accuracy Baseline Acc Ratio [(INT8-FP32)/FP32] INT8 realtime(ms)
CLX8280 1s 4c per instance
FP32 realtime(ms)
CLX8280 1s 4c per instance
Realtime Latency Ratio[FP32/INT8]
bert_base_mrpc 0.8235 83.09% -0.89% 21.91 71.53 3.26x
bert_large 90.6648 90.87 -0.23% 232.38 954.96 4.11x
distilbert_base_uncased_mrpc 0.8407 84.31% -0.28% 10.41 36.42 3.50x

Additional Content

neural-compressor's People

Contributors

guomingz avatar ftian1 avatar penghuicheng avatar clarkchin08 avatar mengniwang95 avatar pengxin99 avatar chensuyue avatar chuanqi129 avatar airmeng avatar tybulewicz avatar xin3he avatar yuwenzho avatar zhaoruic-intel avatar bmyrcha avatar zehao-intel avatar gongzheng0 avatar kblamow avatar 872520333 avatar deb-intel avatar daisyden avatar aradys avatar guyiyun avatar changwangss avatar dliang0406 avatar hshen14 avatar rafkrauz avatar zhiwei35 avatar chendali-intel avatar dbkinder avatar ashahba avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.