zhangjun / zhangjun.github.io Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 641 KB

https://zhangjun.github.io

JavaScript 47.24% Stylus 52.52% Dockerfile 0.24%

zhangjun.github.io's Introduction

zhangjun.github.io

https://zhangjun.github.io

Introduction

content

zhangjun.github.io's People

Contributors

Stargazers

Watchers

zhangjun.github.io's Issues

NVDLA

https://github.com/OAID/Tengine/tree/tengine-lite/source/device/opendla

深度学习基础知识

归一化方法

Batch Norm

Batch Norm在通道维度进行归一化，最后得到C个统计量u，δ。假设输入特征为[N, H, W, C]，在C的每个维度上对[N, H, W]计算其均值、方差，用于该维度上的归一化操作。

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

image = [np.random.randn(30, 40, 3) for _ in range(16)]
image = rearrange(image, 'b h w c -> b h w c')
# print(rearrange(image, 'b h w c -> b h w c').shape)

image_ = rearrange(image, 'b h w c -> (b h w) c')
mean = rearrange(image_.mean(axis=0), 'c -> 1 1 1 c')
std = rearrange(image_.std(axis=0), 'c -> 1 1 1 c')

y_ =  (image - mean)/std

b, h, w, c = image.shape
bn = nn.BatchNorm2d(c, eps=1e-10, affine=False, track_running_stats=False)
y = bn(torch.from_numpy(image))

print('diff={}\n'.format(torch.abs(y - y_).max()))

Layer Norm

Layer Norm以样本为单位计算统计量，因此最后会得到N个u，δ。假设输入特征为[N, H, W, C]，在N的每个维度上对[H, W，C]计算其均值、方差，用于该维度上的归一化操作。

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 3, 20, 20))
b, c, h, w = x.shape

layer_norm = nn.LayerNorm([c, h, w], eps=1e-12, elementwise_affine=False)
y = layer_norm(x)

x_ = rearrange(x, 'b c h w -> (h w c) b')
mean = rearrange(x_.mean(axis=0), 'b -> b 1 1 1')
std = rearrange(x_.std(axis=0), 'b -> b 1 1 1')

y_ =  (x - mean)/std

print('diff={}\n'.format(torch.abs(y - y_).max()))

Instance Norm

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 3, 20, 20))
b, c, h, w = x.shape

instance_norm = nn.InstanceNorm2d(c, eps=1e-12, affine=False, track_running_stats=False)
y = instance_norm(x)

x_ = rearrange(x, 'b c h w -> b c (h w)')
# mean = rearrange(x_.mean(axis=2), 'b c -> b c 1 1')
# std = rearrange(x_.std(axis=2), 'b c -> b c 1 1')
mean = rearrange(x_.mean(dim=2), 'b c -> b c 1 1')
std = rearrange(x_.std(dim=2), 'b c -> b c 1 1')

y_ =  (x - mean)/std

print('diff={}\n'.format(torch.abs(y - y_).max()))

Group Norm

import numpy as np
import torch
import torch.nn as nn
from einops import rearrange, repeat, reduce

x = torch.randn((6, 6, 20, 20))
b, c, h, w = x.shape
group_num = 3
n = 2

group_norm = nn.GroupNorm(group_num, c, eps=1e-12, affine=False)
y = group_norm(x)

x_ = rearrange(x, 'b (g n) h w -> b g (n h w)', g = group_num) # [6, 3, 2*20*20]
print(x_.shape)
mean = rearrange(x_.mean(dim=2), 'b g -> b g 1')  # [6, 3, 1]
std = rearrange(x_.std(dim=2), 'b g -> b g 1')

y_ =  (x_ - mean)/std
y_ = rearrange(y_, 'b g (n h w) -> b (g n) h w', g = group_num, h = h, w = w)

print('diff={}\n'.format(torch.abs(y - y_).max()))

Performance Optimization

Performance Measurement

CPU

theoretical peak
two Intel Xeon E5-2697 v2 (2S-E5) with 12 cores per CPU, each running at 2.7 GHz without turbo mode. These processors support the AVX extension with 256-bit SIMD instructions that can process 8 single precision (32 bits) numbers per CPU cycle.
theoretical peak Flop/s is 2.7 (GHz) × 8 (SP FP) × 2 (ADD/MULL) × 12 (cores) × 2 (CPUs) = 1036.8 GFlop/s.
memory bandwidth
theoretical memory bandwidth is computed from the memory frequency (1866 GHz), the number of channels (4), the number of bytes transferred by channel per cycle (8), which gives 1866 × 4 × 8 × 2 (# of processors) = 119 GByte/s peak bandwidth for the dual socket 2S-E5 system.

GPU

Metal

Metal for Paddle Lite

Metal kernel and context
Metal OP executation

c++

C++ 11

常用C++11特性

1、移动语义

2、右值引用

3、智能指针

4、初始化列表

5、静态断言

static_assert()
编译期间的断言

6、noexcept

noexcept修饰符
函数声明后添加noexcept，表示修饰的函数不会抛出异常。如果抛出异常，编译器直接调用std::terminate()函数终止程序运行，比基于异常机制的throw()效率更高。因为异常机制会有额外开销，函数栈会被依次展开，调用在本帧中已构造的自动变量的析构函数。
noexcept操作符

7、push_back和emplace_back

push_back会触发构造函数和移动构造函数
emplace_back原地构造

8、lambda

[ capture ] ( params ) opt -> ret { body; };

捕获列表：

[] 不捕获任何变量
[=] 按值
[&] 按引用
[this] 值传递捕获当前this
捕获列表不允许变量重复传递
lambda函数是一个const函数，mutable可以取消常量属性。

todo

TensorRT

ONNXRuntime TRT

docker build -f Dockerfile.manylinux2014_cuda11_4_tensorrt8_2 --network=host --build-arg POLICY=manylinux2014 --build-arg PLATFORM=x86_64 --build-arg DEVTOOLSET_ROOTPATH=/opt/rh/devtoolset-10/root --build-arg PREPEND_PATH=/opt/rh/devtoolset-10/root/usr/bin: --build-arg LD_LIBRARY_PATH_ARG=/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/usr/local/lib64 --tag=onnxruntime:cuda11.4_trt8.2 .

MLIR

模型压缩

TO ADD

GEMM

Paddle Lite 代码阅读

optimizer

lite::Optimizer optimize a program. It utilize the mir passes to analysis the program and export an optimized program.

std::unique_ptr<RuntimeProgram> RunDefaultOptimizer(
    Program&& program,
    const std::vector<Place>& valid_places,
    core::KernelPickFactor kernel_pick_factor,
    const std::vector<std::string>& passes) {
  Optimizer optim(valid_places, kernel_pick_factor);
  // ...
  for (auto& pass_name : passes_local) {
    optim.AddPass(pass_name);
  }

  return optim.Run(std::move(program));
}

class Optimizer {
 public:
  Optimizer(const std::vector<Place>& valid_places,
            core::KernelPickFactor kernel_pick_factor)
      : valid_places_(valid_places), kernel_pick_factor_(kernel_pick_factor) {
    CHECK(!valid_places.empty()) << "At least one valid_place should be set";
  }

  // Append a pass to the optimizer.
  void AddPass(const std::string& pass_name);
  // Optimize a program to generate a runtime program.
  std::unique_ptr<RuntimeProgram> Run(Program&& program);

 protected:
  // Run all the added passes.
  void ApplyPasses(std::vector<std::unique_ptr<mir::SSAGraph>>* graphes);

  // Generate the optimized runtime program.
  std::unique_ptr<RuntimeProgram> GenRuntimeProgram(
      std::vector<std::unique_ptr<mir::SSAGraph>>* graphs);

  void InitTargetTypeTransformPass();
  void InitControlFlowOpUnusedInputsAndOutputsEliminatePass();
  void InitControlFlowOpSharedInputsAndOutputsPlaceSyncPass();
  void SpecifyKernelPickTactic(core::KernelPickFactor factor);
  Scope* exec_scope() { return exec_scope_; }

 private:
  std::vector<Place> valid_places_;
  Scope* exec_scope_{};
  std::vector<mir::Pass*> passes_;
  std::vector<std::unique_ptr<mir::SSAGraph>> graphs_;
  core::KernelPickFactor kernel_pick_factor_;
};

移动端框架链接

MNN optimizer
https://github.com/alibaba/MNN/tree/master/tools/converter/source/optimizer/merge
https://github.com/alibaba/MNN/tree/master/tools/converter/source/optimizer/postconvert

Paddle

IR

serving框架

https://github.com/Adlik/Adlik

onnx

file_path = './onnx_model/rec_large.onnx'
model = onnx.load(file_path)
model.graph.input[0].type.tensor_type.shape.dim[0].dim_param = '?'
model.graph.input[0].type.tensor_type.shape.dim[2].dim_param = '?'
model.graph.input[0].type.tensor_type.shape.dim[3].dim_param = '?'
onnx.save(model, './onnx_model/rec_large_dynamic.onnx')

ONNXRumtime

optimizer

https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/core/optimizer

elination

fusion

layer_norm
matmul_add
matmul_scale
matmul_transpose
gelu

向量检索

算法选型

性能、内存/磁盘、召回率、实时更新、增量更新、删除

算法	召回效果	内存	增量更新
HNSW	好	略高	是
PQ	好	低	否
ANNOY	较好	中	是

实现	性能	内存	易用性
FAISS	较好	低	好
Nmslib	好	高	差

LLM

repo链接

https://github.com/THUDM/ChatGLM-6B
https://github.com/mymusise/ChatGLM-Tuning
https://github.com/LianjiaTech/BELLE

LLM量化

https://zhuanlan.zhihu.com/p/616969812

SmoothQuant
Outlier Suppression
- Outlier Suppression
- Outlier Suppression+
AWQ
基于激活与参数大小的缩放保护
LLM.int8
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
- https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
- https://github.com/timdettmers/bitsandbytes
GPTQ
- GPTQ: 模型量化，穷鬼救星
  不使用贪心算法，对W进行优化时，固定位置挑选Q
  W不同列间权重更新互相独立，使用批处理更新（Lazy Batch更新，group_size参数控制）
  数值稳定性优化
- 4-bit LLM Quantization with GPTQ
ZeroQuant
- microsoft/DeepSpeed#2217
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
LUT-GEMM
- LUT-GEMM: 基于lut的量化矩阵乘法在大规模生成语言模型中的高效推理
SparseGPT
weight only

fine-tune

PEFT
QLora

LLM papers

Awesome-LLM-System-Papers
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
https://flexflow.ai/specInfer/
https://www.gabrieleoliaro.com/publication/expertflow/expertflow.pdf

量化

PaddleSlim量化

PaddleSlim主要包含三种量化方法：量化训练(Quant Aware Training, QAT)、动态离线量化(Post Training Quantization Dynamic, PTQ Dynamic)、静态离线量化(Post Training Quantization Static, PTQ Static)。

量化训练量化训练让模型感知量化运算对模型精度带来的影响，通过finetune训练降低量化误差。
动态离线量化动态离线量化仅将模型中特定算子的权重从FP32类型映射成INT8/16类型。
静态离线量化静态离线量化使用少量无标签校准数据，采用KL散度等方法计算量化比例因子。

综合对比了模型量化方法的使用条件、易用性、精度损失和预期收益。

量化方法	API接口	功能	经典适用场景
在线量化 (QAT)	动态图：paddleslim.QAT; 静态图：paddleslim.quant.quant_aware	通过finetune训练将模型量化误差降到最小	对量化敏感的场景、模型，例如目标检测、分割, OCR
静态离线量化 (PTQ Static)	paddleslim.quant.quant_post_static	通过少量校准数据得到量化模型	对量化不敏感的场景，例如图像分类任务
动态离线量化 (PTQ Dynamic)	paddleslim.quant.quant_post_dynamic	仅量化模型的可学习权重	模型体积大、访存开销大的模型，例如BERT模型
Embedding量化（Quant Embedding）	paddleslim.quant.quant_embedding	仅量化模型的Embedding参数	任何包含Embedding层的模型

静态离线量化（Post Training Quantization Static, PTQ Static）

静态离线量化中，有两种计算量化因子的方法，非饱和量化方法和饱和量化方法。非饱和量化方法计算整个Tensor的绝对值最大值abs_max，将其映射为127。饱和量化方法使用KL散度计算一个合适的阈值T (0<T<mab_max)，将其映射为127。一般而言，待量化Op的权重采用非饱和量化方法，待量化Op的激活（输入和输出）采用饱和量化方法。

设计模式

git

github ssh配置

生成key

ssh-keygen -t rsa -f ~/.ssh/baidu_id_rsa

配置～/.ssh/config文件

#GitHub
Host github.com
HostName github.com
PreferredAuthentications publickey
IdentityFile ~/.ssh/id_rsa

~/.ssh/config 文件权限必须为644

git 常用命令

git clone --depth 1 --branch v5.0.8 --no-checkout https://github.com/emqx/emqx.git

GPU course

https://vccvisualization.org/CS380_GPU_and_GPGPU_Programming/

TensorFlow

grappler

Papers

accelerator

Modeling Deep Learning Accelerator Enabled GPUs

transformer优化

ANP transformer优化

MLIR

compile

cmake -G Ninja ../llvm \
 -DLLVM_ENABLE_PROJECTS="mlir;clang" \
 -DLLVM_BUILD_EXAMPLES=OFF \
 -DLLVM_TARGETS_TO_BUILD="host;NVPTX;X86" \
 -DCMAKE_BUILD_TYPE=Release \
 -DLLVM_ENABLE_ASSERTIONS=ON \
 -DMLIR_ENABLE_BINDINGS_PYTHON=ON

CUDA

cutlass

代码解析

/// Policy object describing MmaTensorOp
template <
    /// Warp-level GEMM operator (concept: gemm::warp::Mma)
    typename Operator_,
    /// Padding used for A operand in shared memory (concept: MatrixShape)
    typename SmemPaddingA_,
    /// Padding used for B operand in shared memory (concept: MatrixShape)
    typename SmemPaddingB_,
    /// Number of partitions of K dimension of GEMM
    int PartitionsK = 1>
struct MmaPolicy {
  /// Warp-level GEMM operator (concept: gemm::warp::MmaTensorOp or gemm::warp::MmaSimt)
  using Operator = Operator_;

  /// Padding used for A operand in shared memory
  using SmemPaddingA = SmemPaddingA_;

  /// Padding used for B operand in shared memory
  using SmemPaddingB = SmemPaddingB_;

  /// Number of partitions of K dimension
  static int const kPartitionsK = PartitionsK;
};

/// Structure to compute the matrix product targeting CUDA cores and SIMT math
/// instructions.
template <
    /// Size of the Gemm problem - concept: gemm::GemmShape<>
    typename Shape_,
    /// Policy describing tuning details (concept: MmaPolicy)
    typename Policy_,
    /// Number of stages,
    int Stages,
    /// Used for partial specialization
    typename Enable = bool>
class MmaBase {
 public:
  ///< Size of the Gemm problem - concept: gemm::GemmShape<>
  using Shape = Shape_;

  ///< Policy describing tuning details
  using Policy = Policy_;

  //
  // Dependent types
  //

  /// Warp-level Mma
  using Operator = typename Policy::Operator;

  /// Shape describing the overall GEMM computed from shared memory
  /// by each warp.
  using WarpGemm = typename Policy::Operator::Shape;

  /// Shape describing the number of warps filling the CTA
  using WarpCount = GemmShape<Shape::kM / WarpGemm::kM,
                              Shape::kN / WarpGemm::kN,
                              Shape::kK / WarpGemm::kK>;

  /// Number of warp-level GEMM oeprations
  static int const kWarpGemmIterations =
      (WarpGemm::kK / Operator::Policy::MmaShape::kK);

  /// Number of stages
  static int const kStages = Stages;

  /// Tensor reference to the A operand
  using TensorRefA = TensorRef<typename Operator::ElementA, typename Operator::LayoutA>;

  /// Tensor reference to the B operand
  using TensorRefB = TensorRef<typename Operator::ElementB, typename Operator::LayoutB>;

  static_assert(kWarpGemmIterations > 1,
                "The pipelined structure requires at least two warp-level "
                "GEMM operations.");

  static_assert((kWarpGemmIterations % 2) == 0,
                "Inner loop iteration must be an even number.");

  //
  // Nested structs
  //

  /// Shared storage object needed by threadblock-scoped GEMM
  class SharedStorage {
   public:
    //
    // Type definitions
    //

    /// Shape of the A matrix operand in shared memory
    using ShapeA = MatrixShape<Shape::kM + Policy::SmemPaddingA::kRow,
                               Shape::kK * kStages +
                                   Policy::SmemPaddingA::kColumn>;

    /// Shape of the B matrix operand in shared memory
    using ShapeB =
        MatrixShape<Shape::kK * kStages + Policy::SmemPaddingB::kRow,
                    Shape::kN + Policy::SmemPaddingB::kColumn>;

   public:
    //
    // Data members
    //

    /// Buffer for A operand
    AlignedBuffer<typename Operator::ElementA, ShapeA::kCount> operand_A;

    /// Buffer for B operand
    AlignedBuffer<typename Operator::ElementB, ShapeB::kCount> operand_B;

   public:

    //
    // Methods
    //

    /// Returns a layout object for the A matrix
    CUTLASS_DEVICE
    static typename Operator::LayoutA LayoutA() {
      return Operator::LayoutA::packed({ShapeA::kRow, ShapeA::kColumn});
    }

    /// Returns a layout object for the B matrix
    CUTLASS_HOST_DEVICE
    static typename Operator::LayoutB LayoutB() {
      return Operator::LayoutB::packed({ShapeB::kRow, ShapeB::kColumn});
    }

    /// Returns a TensorRef to the A operand
    CUTLASS_HOST_DEVICE
    TensorRefA operand_A_ref() {
      return TensorRefA{operand_A.data(), LayoutA()};
    }

    /// Returns a TensorRef to the B operand
    CUTLASS_HOST_DEVICE
    TensorRefB operand_B_ref() {
      return TensorRefB{operand_B.data(), LayoutB()};
    }
  };

 protected:

  //
  // Data members
  //

  /// Iterator to load a warp-scoped tile of A operand from shared memory
  typename Operator::IteratorA warp_tile_iterator_A_;

  /// Iterator to load a warp-scoped tile of B operand from shared memory
  typename Operator::IteratorB warp_tile_iterator_B_;

public:

  /// Construct from tensor references
  CUTLASS_DEVICE
  MmaBase(
      ///< Shared storage needed for internal use by threadblock-scoped GEMM
      SharedStorage &shared_storage,
      ///< ID within the threadblock
      int thread_idx,
      ///< ID of warp
      int warp_idx,
      ///< ID of each thread within a warp
      int lane_idx
    ):
      warp_tile_iterator_A_(shared_storage.operand_A_ref(), lane_idx),
      warp_tile_iterator_B_(shared_storage.operand_B_ref(), lane_idx) {

  }
};

cpp book

modern cpp

modern-cpp-tutorial-en
modern-cpp-tutorial-cn

GPU

Turing

Brand Name	GPU Architecture	Tensor Core	NVIDIA CUDA® Cores	TensorFLOPS	Single-Precision	Double-Precision	Mixed-Precision(FP16/FP32)	INT8	INT4	GPU Memory	Interconnect Bandwidth	System Interface
V100 PCle	NVIDIA Volta	640 1nd	5,120	112 TFLOPS	14 TFLOPS	7 TFLOPS	12x TFLOPS			32 GB HBM2 900 GB/sec	32 GB/sec	x16 PCIe Gen3
V100 SXM2	NVIDIA Volta	640 1nd	5,120	125 TFLOPS	15.7 TFLOPS	7.8 TFLOPS				32 GB HBM2 900 GB/sec	300 GB/sec	x6 NVLink 2.0
T4	NVIDIA Turing	320 2nd	2,560		8.1 TFLOPS		65 TFLOPS	130 TOPS	260 TOPS	16 GB GDDR6 300 GB/sec	32 GB/sec	x16 PCIe Gen3

GPU Infos

P100

V100
- ALU
  - 5376 FP32 cores = 6 GPC * 7 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 8 Tensor Cores, Four texture units)
  - SM
    - 64 FP32 + INT32 cores, 32 FP64 cores, 8 tensor cores(FP32/FP16 mixed-precision)
    - 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 2 tensor cores, 8 LD/ST units
  - TensorCore: 64 floating point（FP16） FMA / (TensorCore *clock)， 512 per SM per clock. 64 * 640 TensorCore * 2 * 1530 Mhz = 125 TFlops
  - single precision (FP32) floating-point calculations: (5120 FP32 CUDA cores) × (2 flop/core/cycle) × (1.53 Gcycle/s) ≈ 15.7 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
- Mem
  - 512 bit * 8 memory controllers
  - 6144 KB L2 cache
- IO
  - six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s.

A100
- ALU
  - 8192 FP32 cores = 8 GPC * 8 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 4 Tensor Cores, Four texture units)
  - SM
    - 64 FP32 + INT32 cores, 32 FP64 cores, 4 * 3rd tensor cores(FP32/FP16, int8/int4 mixed-precision)
    - 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 1 tensor cores, 8 LD/ST units
  - TensorCore: 256 floating point（FP16） FMA / (TensorCore *clock)， 256 * 4 * 108 * 2 * 1.41 Gcycle/s = 312 TFlops. Sparse performance double.
  - single precision (FP32) floating-point calculations: (8192 FP32 CUDA cores) × (2 flop/core/cycle) × (1.41 Gcycle/s) ≈ 23.1 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
  - 108(7 * 8 * 2) * 64 * 1410 Mhz * 2 = 19.5 TFlops
- Mem/IO
  - 512 bit * 12 memory controllers (Full A100), 512 * 10 (A100)
  - 5 active HBM2 stacks, HBM2 1215 MHz(10 512-bit memory controllers, not full GA100)，1555 GB/sec = 10 * 512 * 1215 * 2/8
  - 192 KB shared-mem / L1 per SM
  - 40 MB L2 cache
  - NVLink: 50 Gbit/sec * 12 = 600 Gbit/sec

H00
https://www.nextplatform.com/2022/03/31/deep-dive-into-nvidias-hopper-gpu-architecture/
https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf

Comparison of NVIDIA Tesla GPUs

Data Center GPU	NVIDIA Tesla P100	NVIDIA Tesla V100	NVIDIA A100
GPU Codename	GP100	GV100	GA100
GPU Architecture	NVIDIA Pascal	NVIDIA Volta	NVIDIA Ampere
GPU Board Form Factor	SXM	SXM2	SXM4
SMs	56	80	108
TPCs	28	40	54
FP32 Cores / SM	64	64	64
FP32 Cores / GPU	3584	5120	6912
FP64 Cores / SM	32	32	32
FP64 Cores / GPU	1792	2560	3456
INT32 Cores / SM	NA	64	64
INT32 Cores / GPU	NA	5120	6912
Tensor Cores / SM	NA	8	42
Tensor Cores / GPU	NA	640	432
GPU Boost Clock	1480 MHz	1530 MHz	1410 MHz
Peak FP16 Tensor TFLOPS with FP16 Accumulate1	NA	125	312/6243
Peak FP16 Tensor TFLOPS with FP32 Accumulate1	NA	125	312/6243
Peak BF16 Tensor TFLOPS with FP32 Accumulate1	NA	NA	312/6243
Peak TF32 Tensor TFLOPS1	NA	NA	156/3123
Peak FP64 Tensor TFLOPS1	NA	NA	19.5
Peak INT8 Tensor TOPS1	NA	NA	624/12483
Peak INT4 Tensor TOPS1	NA	NA	1248/24963
Peak FP16 TFLOPS1	21.2	31.4	78
Peak BF16 TFLOPS1	NA	NA	39
Peak FP32 TFLOPS1	10.6	15.7	19.5
Peak FP64 TFLOPS1	5.3	7.8	9.7
Peak INT32 TOPS1,4	NA	15.7	19.5
Texture Units	224	320	432
Memory Interface	4096-bit HBM2	4096-bit HBM2	5120-bit HBM2
Memory Size	16 GB	32 GB / 16 GB	40 GB
Memory Data Rate	703 MHz DDR	877.5 MHz DDR	1215 MHz DDR
Memory Bandwidth	720 GB/sec	900 GB/sec	1555 GB/sec
L2 Cache Size	4096 KB	6144 KB	40960 KB
Shared Memory Size / SM	64 KB	Configurable up to 96 KB	Configurable up to 164 KB
Register File Size / SM	256 KB	256 KB	256 KB
Register File Size / GPU	14336 KB	20480 KB	27648 KB
TDP	300 Watts	300 Watts	400 Watts
Transistors	15.3 billion	21.1 billion	54.2 billion
GPU Die Size	610 mm²	815 mm²	826 mm2
TSMC Manufacturing Process	16 nm FinFET+	12 nm FFN	7 nm N7

Paddle Inference

todo

nsight system

nsight systems 和 nsight compute都是基于CUDA Profiling Tools Interface(CUPTI) 构建。

nsys profile --stats=true ./main

CUDA_VISIBLE_DEVICES=3 nsys profile -t cuda,nvtx,cublas,cublas-verbose,cusparse,cusparse-verbose,cudnn --stats=true --cuda-memory-usage true python main.py --model_file=../../../work/test/infer_bench/Models/MobileNetV1/inference.pdmodel --params_file=../../../work/test/infer_bench/Models/MobileNetV1/inference.pdiparams --use_gpu=1 --repeat=2

c++

c++ format

"C_Cpp.clang_format_style": "{BasedOnStyle: Webkit, BreakBeforeBraces: Attach, IndentWidth: 4, BinPackParameters: false, NamespaceIndentation: None, BreakConstructorInitializers: AfterColon, ContinuationIndentWidth: 8, ConstructorInitializerIndentWidth: 8, ColumnLimit: 120, AlwaysBreakTemplateDeclarations: Yes, AllowShortFunctionsOnASingleLine: None}"

zhangjun / zhangjun.github.io Goto Github PK

zhangjun.github.io's Introduction

zhangjun.github.io

Introduction

content

zhangjun.github.io's People

Contributors

Stargazers

Watchers

zhangjun.github.io's Issues

归一化方法

Batch Norm

Layer Norm

Instance Norm

Group Norm

CPU

GPU

Metal for Paddle Lite

C++ 11

常用C++11特性

1、移动语义

2、右值引用

3、智能指针

4、初始化列表

5、静态断言

6、noexcept

7、push_back和emplace_back

8、lambda

todo

ONNXRuntime TRT

TO ADD

optimizer

IR

optimizer

elination

fusion

向量检索

算法选型

repo链接

LLM量化

fine-tune

LLM papers

PaddleSlim量化

静态离线量化（Post Training Quantization Static, PTQ Static）

github ssh配置

生成key

配置～/.ssh/config文件

git 常用命令

grappler

accelerator

compile

代码解析

modern cpp

Turing

GPU Infos

Comparison of NVIDIA Tesla GPUs

todo

nsight system

c++ format

feature store

Recommend Projects

Recommend Topics

Recommend Org