Batch Norm

Batch Norm在通道维度进行归一化,最后得到C个统计量u,δ。假设输入特征为[N, H, W, C],在C的每个维度上对[N, H, W]计算其均值、方差,用于该维度上的归一化操作。

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

image = [np.random.randn(30, 40, 3) for _ in range(16)]
image = rearrange(image, 'b h w c -> b h w c')
# print(rearrange(image, 'b h w c -> b h w c').shape)

image_ = rearrange(image, 'b h w c -> (b h w) c')
mean = rearrange(image_.mean(axis=0), 'c -> 1 1 1 c')
std = rearrange(image_.std(axis=0), 'c -> 1 1 1 c')

y_ =  (image - mean)/std

b, h, w, c = image.shape
bn = nn.BatchNorm2d(c, eps=1e-10, affine=False, track_running_stats=False)
y = bn(torch.from_numpy(image))

print('diff={}\n'.format(torch.abs(y - y_).max()))

Layer Norm

Layer Norm以样本为单位计算统计量,因此最后会得到N个u,δ。假设输入特征为[N, H, W, C],在N的每个维度上对[H, W,C]计算其均值、方差,用于该维度上的归一化操作。

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 3, 20, 20))
b, c, h, w = x.shape

layer_norm = nn.LayerNorm([c, h, w], eps=1e-12, elementwise_affine=False)
y = layer_norm(x)

x_ = rearrange(x, 'b c h w -> (h w c) b')
mean = rearrange(x_.mean(axis=0), 'b -> b 1 1 1')
std = rearrange(x_.std(axis=0), 'b -> b 1 1 1')

y_ =  (x - mean)/std

print('diff={}\n'.format(torch.abs(y - y_).max()))

Instance Norm

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 3, 20, 20))
b, c, h, w = x.shape

instance_norm = nn.InstanceNorm2d(c, eps=1e-12, affine=False, track_running_stats=False)
y = instance_norm(x)

x_ = rearrange(x, 'b c h w -> b c (h w)')
# mean = rearrange(x_.mean(axis=2), 'b c -> b c 1 1')
# std = rearrange(x_.std(axis=2), 'b c -> b c 1 1')
mean = rearrange(x_.mean(dim=2), 'b c -> b c 1 1')
std = rearrange(x_.std(dim=2), 'b c -> b c 1 1')

y_ =  (x - mean)/std

print('diff={}\n'.format(torch.abs(y - y_).max()))

Group Norm

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 6, 20, 20))
b, c, h, w = x.shape
group_num = 3
n = 2

group_norm = nn.GroupNorm(group_num, c, eps=1e-12, affine=False)
y = group_norm(x)

x_ = rearrange(x, 'b (g n) h w -> b g (n h w)', g = group_num) # [6, 3, 2*20*20]
mean = rearrange(x_.mean(dim=2), 'b g -> b g 1')  # [6, 3, 1]
std = rearrange(x_.std(dim=2), 'b g -> b g 1')

y_ =  (x_ - mean)/std
y_ = rearrange(y_, 'b g (n h w) -> b (g n) h w', g = group_num, h = h, w = w)

print('diff={}\n'.format(torch.abs(y - y_).max()))

Performance Optimization

Performance Measurement


  • theoretical peak
    two Intel Xeon E5-2697 v2 (2S-E5) with 12 cores per CPU, each running at 2.7 GHz without turbo mode. These processors support the AVX extension with 256-bit SIMD instructions that can process 8 single precision (32 bits) numbers per CPU cycle.
    theoretical peak Flop/s is 2.7 (GHz) × 8 (SP FP) × 2 (ADD/MULL) × 12 (cores) × 2 (CPUs) = 1036.8 GFlop/s.

  • memory bandwidth
    theoretical memory bandwidth is computed from the memory frequency (1866 GHz), the number of channels (4), the number of bytes transferred by channel per cycle (8), which gives 1866 × 4 × 8 × 2 (# of processors) = 119 GByte/s peak bandwidth for the dual socket 2S-E5 system.



Metal for Paddle Lite


  • Metal kernel and context
  • Metal OP executation


C++ 11









  • noexcept修饰符
  • noexcept操作符




[ capture ] ( params ) opt -> ret { body; };


  • [] 不捕获任何变量
  • [=] 按值
  • [&] 按引用
  • [this] 值传递捕获当前this




docker build -f Dockerfile.manylinux2014_cuda11_4_tensorrt8_2 --network=host --build-arg POLICY=manylinux2014 --build-arg PLATFORM=x86_64 --build-arg DEVTOOLSET_ROOTPATH=/opt/rh/devtoolset-10/root --build-arg PREPEND_PATH=/opt/rh/devtoolset-10/root/usr/bin: --build-arg LD_LIBRARY_PATH_ARG=/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/usr/local/lib64 --tag=onnxruntime:cuda11.4_trt8.2 .

Paddle Lite 代码阅读


lite::Optimizer optimize a program. It utilize the mir passes to analysis the program and export an optimized program.

std::unique_ptr<RuntimeProgram> RunDefaultOptimizer(
    Program&& program,
    const std::vector<Place>& valid_places,
    core::KernelPickFactor kernel_pick_factor,
    const std::vector<std::string>& passes) {
  Optimizer optim(valid_places, kernel_pick_factor);
  // ...
  for (auto& pass_name : passes_local) {

  return optim.Run(std::move(program));
class Optimizer {
  Optimizer(const std::vector<Place>& valid_places,
            core::KernelPickFactor kernel_pick_factor)
      : valid_places_(valid_places), kernel_pick_factor_(kernel_pick_factor) {
    CHECK(!valid_places.empty()) << "At least one valid_place should be set";

  // Append a pass to the optimizer.
  void AddPass(const std::string& pass_name);
  // Optimize a program to generate a runtime program.
  std::unique_ptr<RuntimeProgram> Run(Program&& program);

  // Run all the added passes.
  void ApplyPasses(std::vector<std::unique_ptr<mir::SSAGraph>>* graphes);

  // Generate the optimized runtime program.
  std::unique_ptr<RuntimeProgram> GenRuntimeProgram(
      std::vector<std::unique_ptr<mir::SSAGraph>>* graphs);

  void InitTargetTypeTransformPass();
  void InitControlFlowOpUnusedInputsAndOutputsEliminatePass();
  void InitControlFlowOpSharedInputsAndOutputsPlaceSyncPass();
  void SpecifyKernelPickTactic(core::KernelPickFactor factor);
  Scope* exec_scope() { return exec_scope_; }

  std::vector<Place> valid_places_;
  Scope* exec_scope_{};
  std::vector<mir::Pass*> passes_;
  std::vector<std::unique_ptr<mir::SSAGraph>> graphs_;
  core::KernelPickFactor kernel_pick_factor_;


file_path = './onnx_model/rec_large.onnx'
model = onnx.load(file_path)
model.graph.input[0].type.tensor_type.shape.dim[0].dim_param = '?'
model.graph.input[0].type.tensor_type.shape.dim[2].dim_param = '?'
model.graph.input[0].type.tensor_type.shape.dim[3].dim_param = '?', './onnx_model/rec_large_dynamic.onnx')





算法 召回效果 内存 增量更新
实现 性能 内存 易用性





PaddleSlim主要包含三种量化方法:量化训练(Quant Aware Training, QAT)、动态离线量化(Post Training Quantization Dynamic, PTQ Dynamic)、静态离线量化(Post Training Quantization Static, PTQ Static)。

  • 量化训练 量化训练让模型感知量化运算对模型精度带来的影响,通过finetune训练降低量化误差。
  • 动态离线量化 动态离线量化仅将模型中特定算子的权重从FP32类型映射成INT8/16类型。
  • 静态离线量化 静态离线量化使用少量无标签校准数据,采用KL散度等方法计算量化比例因子。




量化方法 API接口 功能 经典适用场景
在线量化 (QAT) 动态图:paddleslim.QAT; 静态图:paddleslim.quant.quant_aware 通过finetune训练将模型量化误差降到最小 对量化敏感的场景、模型,例如目标检测、分割, OCR
静态离线量化 (PTQ Static) paddleslim.quant.quant_post_static 通过少量校准数据得到量化模型 对量化不敏感的场景,例如图像分类任务
动态离线量化 (PTQ Dynamic) paddleslim.quant.quant_post_dynamic 仅量化模型的可学习权重 模型体积大、访存开销大的模型,例如BERT模型
Embedding量化(Quant Embedding) paddleslim.quant.quant_embedding 仅量化模型的Embedding参数 任何包含Embedding层的模型

静态离线量化(Post Training Quantization Static, PTQ Static)

静态离线量化中,有两种计算量化因子的方法,非饱和量化方法和饱和量化方法。非饱和量化方法计算整个Tensor的绝对值最大值abs_max,将其映射为127。饱和量化方法使用KL散度计算一个合适的阈值T (0<T<mab_max),将其映射为127。一般而言,待量化Op的权重采用非饱和量化方法,待量化Op的激活(输入和输出)采用饱和量化方法 。


github ssh配置


ssh-keygen -t rsa -f ~/.ssh/baidu_id_rsa


PreferredAuthentications publickey
IdentityFile ~/.ssh/id_rsa

~/.ssh/config 文件权限必须为644

git 常用命令

git clone --depth 1 --branch v5.0.8 --no-checkout



cmake -G Ninja ../llvm \
 -DLLVM_ENABLE_PROJECTS="mlir;clang" \



/// Policy object describing MmaTensorOp
template <
    /// Warp-level GEMM operator (concept: gemm::warp::Mma)
    typename Operator_,
    /// Padding used for A operand in shared memory (concept: MatrixShape)
    typename SmemPaddingA_,
    /// Padding used for B operand in shared memory (concept: MatrixShape)
    typename SmemPaddingB_,
    /// Number of partitions of K dimension of GEMM
    int PartitionsK = 1>
struct MmaPolicy {
  /// Warp-level GEMM operator (concept: gemm::warp::MmaTensorOp or gemm::warp::MmaSimt)
  using Operator = Operator_;

  /// Padding used for A operand in shared memory
  using SmemPaddingA = SmemPaddingA_;

  /// Padding used for B operand in shared memory
  using SmemPaddingB = SmemPaddingB_;

  /// Number of partitions of K dimension
  static int const kPartitionsK = PartitionsK;
/// Structure to compute the matrix product targeting CUDA cores and SIMT math
/// instructions.
template <
    /// Size of the Gemm problem - concept: gemm::GemmShape<>
    typename Shape_,
    /// Policy describing tuning details (concept: MmaPolicy)
    typename Policy_,
    /// Number of stages,
    int Stages,
    /// Used for partial specialization
    typename Enable = bool>
class MmaBase {
  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
  using Shape = Shape_;

  ///< Policy describing tuning details
  using Policy = Policy_;

  // Dependent types

  /// Warp-level Mma
  using Operator = typename Policy::Operator;

  /// Shape describing the overall GEMM computed from shared memory
  /// by each warp.
  using WarpGemm = typename Policy::Operator::Shape;

  /// Shape describing the number of warps filling the CTA
  using WarpCount = GemmShape<Shape::kM / WarpGemm::kM,
                              Shape::kN / WarpGemm::kN,
                              Shape::kK / WarpGemm::kK>;

  /// Number of warp-level GEMM oeprations
  static int const kWarpGemmIterations =
      (WarpGemm::kK / Operator::Policy::MmaShape::kK);

  /// Number of stages
  static int const kStages = Stages;

  /// Tensor reference to the A operand
  using TensorRefA = TensorRef<typename Operator::ElementA, typename Operator::LayoutA>;

  /// Tensor reference to the B operand
  using TensorRefB = TensorRef<typename Operator::ElementB, typename Operator::LayoutB>;

  static_assert(kWarpGemmIterations > 1,
                "The pipelined structure requires at least two warp-level "
                "GEMM operations.");

  static_assert((kWarpGemmIterations % 2) == 0,
                "Inner loop iteration must be an even number.");

  // Nested structs

  /// Shared storage object needed by threadblock-scoped GEMM
  class SharedStorage {
    // Type definitions

    /// Shape of the A matrix operand in shared memory
    using ShapeA = MatrixShape<Shape::kM + Policy::SmemPaddingA::kRow,
                               Shape::kK * kStages +

    /// Shape of the B matrix operand in shared memory
    using ShapeB =
        MatrixShape<Shape::kK * kStages + Policy::SmemPaddingB::kRow,
                    Shape::kN + Policy::SmemPaddingB::kColumn>;

    // Data members

    /// Buffer for A operand
    AlignedBuffer<typename Operator::ElementA, ShapeA::kCount> operand_A;

    /// Buffer for B operand
    AlignedBuffer<typename Operator::ElementB, ShapeB::kCount> operand_B;


    // Methods

    /// Returns a layout object for the A matrix
    static typename Operator::LayoutA LayoutA() {
      return Operator::LayoutA::packed({ShapeA::kRow, ShapeA::kColumn});

    /// Returns a layout object for the B matrix
    static typename Operator::LayoutB LayoutB() {
      return Operator::LayoutB::packed({ShapeB::kRow, ShapeB::kColumn});

    /// Returns a TensorRef to the A operand
    TensorRefA operand_A_ref() {
      return TensorRefA{, LayoutA()};

    /// Returns a TensorRef to the B operand
    TensorRefB operand_B_ref() {
      return TensorRefB{, LayoutB()};


  // Data members

  /// Iterator to load a warp-scoped tile of A operand from shared memory
  typename Operator::IteratorA warp_tile_iterator_A_;

  /// Iterator to load a warp-scoped tile of B operand from shared memory
  typename Operator::IteratorB warp_tile_iterator_B_;


  /// Construct from tensor references
      ///< Shared storage needed for internal use by threadblock-scoped GEMM
      SharedStorage &shared_storage,
      ///< ID within the threadblock
      int thread_idx,
      ///< ID of warp
      int warp_idx,
      ///< ID of each thread within a warp
      int lane_idx
      warp_tile_iterator_A_(shared_storage.operand_A_ref(), lane_idx),
      warp_tile_iterator_B_(shared_storage.operand_B_ref(), lane_idx) {




Brand Name GPU Architecture Tensor Core NVIDIA CUDA® Cores TensorFLOPS Single-Precision Double-Precision Mixed-Precision(FP16/FP32) INT8 INT4 GPU Memory Interconnect Bandwidth System Interface
V100 PCle NVIDIA Volta 640 1nd 5,120 112 TFLOPS 14 TFLOPS 7 TFLOPS 12x TFLOPS 32 GB HBM2 900 GB/sec 32 GB/sec x16 PCIe Gen3
V100 SXM2 NVIDIA Volta 640 1nd 5,120 125 TFLOPS 15.7 TFLOPS 7.8 TFLOPS 32 GB HBM2 900 GB/sec 300 GB/sec x6 NVLink 2.0
T4 NVIDIA Turing 320 2nd 2,560 8.1 TFLOPS 65 TFLOPS 130 TOPS 260 TOPS 16 GB GDDR6 300 GB/sec 32 GB/sec x16 PCIe Gen3

GPU Infos

  • P100



  • V100
    • ALU
      • 5376 FP32 cores = 6 GPC * 7 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 8 Tensor Cores, Four texture units)
      • SM
        • 64 FP32 + INT32 cores, 32 FP64 cores, 8 tensor cores(FP32/FP16 mixed-precision)
        • 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 2 tensor cores, 8 LD/ST units
      • TensorCore: 64 floating point(FP16) FMA / (TensorCore *clock), 512 per SM per clock. 64 * 640 TensorCore * 2 * 1530 Mhz = 125 TFlops
      • single precision (FP32) floating-point calculations: (5120 FP32 CUDA cores) × (2 flop/core/cycle) × (1.53 Gcycle/s) ≈ 15.7 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
    • Mem
      • 512 bit * 8 memory controllers
      • 6144 KB L2 cache
    • IO
      • six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s.





  • T4


  • A100
    • ALU
      • 8192 FP32 cores = 8 GPC * 8 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 4 Tensor Cores, Four texture units)
      • SM
        • 64 FP32 + INT32 cores, 32 FP64 cores, 4 * 3rd tensor cores(FP32/FP16, int8/int4 mixed-precision)
        • 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 1 tensor cores, 8 LD/ST units
      • TensorCore: 256 floating point(FP16) FMA / (TensorCore *clock), 256 * 4 * 108 * 2 * 1.41 Gcycle/s = 312 TFlops. Sparse performance double.
      • single precision (FP32) floating-point calculations: (8192 FP32 CUDA cores) × (2 flop/core/cycle) × (1.41 Gcycle/s) ≈ 23.1 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
      • 108(7 * 8 * 2) * 64 * 1410 Mhz * 2 = 19.5 TFlops
    • Mem/IO
      • 512 bit * 12 memory controllers (Full A100), 512 * 10 (A100)
      • 5 active HBM2 stacks, HBM2 1215 MHz(10 512-bit memory controllers, not full GA100),1555 GB/sec = 10 * 512 * 1215 * 2/8
      • 192 KB shared-mem / L1 per SM
      • 40 MB L2 cache
      • NVLink: 50 Gbit/sec * 12 = 600 Gbit/sec



Comparison of NVIDIA Tesla GPUs


Data Center GPU NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA A100
GPU Codename GP100 GV100 GA100
GPU Architecture NVIDIA Pascal NVIDIA Volta NVIDIA Ampere
GPU Board Form Factor SXM SXM2 SXM4
SMs 56 80 108
TPCs 28 40 54
FP32 Cores / SM 64 64 64
FP32 Cores / GPU 3584 5120 6912
FP64 Cores / SM 32 32 32
FP64 Cores / GPU 1792 2560 3456
INT32 Cores / SM NA 64 64
INT32 Cores / GPU NA 5120 6912
Tensor Cores / SM NA 8 42
Tensor Cores / GPU NA 640 432
GPU Boost Clock 1480 MHz 1530 MHz 1410 MHz
Peak FP16 Tensor TFLOPS with FP16 Accumulate1 NA 125 312/6243
Peak FP16 Tensor TFLOPS with FP32 Accumulate1 NA 125 312/6243
Peak BF16 Tensor TFLOPS with FP32 Accumulate1 NA NA 312/6243
Peak TF32 Tensor TFLOPS1 NA NA 156/3123
Peak FP64 Tensor TFLOPS1 NA NA 19.5
Peak INT8 Tensor TOPS1 NA NA 624/12483
Peak INT4 Tensor TOPS1 NA NA 1248/24963
Peak FP16 TFLOPS1 21.2 31.4 78
Peak BF16 TFLOPS1 NA NA 39
Peak FP32 TFLOPS1 10.6 15.7 19.5
Peak FP64 TFLOPS1 5.3 7.8 9.7
Peak INT32 TOPS1,4 NA 15.7 19.5
Texture Units 224 320 432
Memory Interface 4096-bit HBM2 4096-bit HBM2 5120-bit HBM2
Memory Size 16 GB 32 GB / 16 GB 40 GB
Memory Data Rate 703 MHz DDR 877.5 MHz DDR 1215 MHz DDR
Memory Bandwidth 720 GB/sec 900 GB/sec 1555 GB/sec
L2 Cache Size 4096 KB 6144 KB 40960 KB
Shared Memory Size / SM 64 KB Configurable up to 96 KB Configurable up to 164  KB
Register File Size / SM 256 KB 256 KB 256 KB
Register File Size / GPU 14336 KB 20480 KB 27648 KB
TDP 300 Watts 300 Watts 400 Watts
Transistors 15.3 billion 21.1 billion 54.2 billion
GPU Die Size 610 mm² 815 mm² 826 mm2
TSMC Manufacturing Process 16 nm FinFET+ 12 nm FFN 7 nm N7

nsight system

nsight system

nsight systems 和 nsight compute都是基于CUDA Profiling Tools Interface(CUPTI) 构建。

nsys profile --stats=true ./main

CUDA_VISIBLE_DEVICES=3 nsys profile -t cuda,nvtx,cublas,cublas-verbose,cusparse,cusparse-verbose,cudnn --stats=true --cuda-memory-usage true python --model_file=../../../work/test/infer_bench/Models/MobileNetV1/inference.pdmodel --params_file=../../../work/test/infer_bench/Models/MobileNetV1/inference.pdiparams --use_gpu=1 --repeat=2


c++ format

"C_Cpp.clang_format_style": "{BasedOnStyle: Webkit, BreakBeforeBraces: Attach, IndentWidth: 4, BinPackParameters: false, NamespaceIndentation: None, BreakConstructorInitializers: AfterColon, ContinuationIndentWidth: 8, ConstructorInitializerIndentWidth: 8, ColumnLimit: 120, AlwaysBreakTemplateDeclarations: Yes, AllowShortFunctionsOnASingleLine: None}"

