Giter Site home page Giter Site logo

xymfei / stable-fast Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chengzeyi/stable-fast

0.0 0.0 0.0 120 KB

快速稳定扩散,一个超轻量级的推理性能优化库An ultra lightweight inference performance optimization library for HuggingFace Diffusers on NVIDIA GPUs.

License: MIT License

C++ 19.38% Python 53.17% Cuda 27.45%

stable-fast's Introduction

Stable Fast

Introduction

NOTE: stable-fast is only in beta stage and is prone to be buggy, feel free to try it out and give suggestions!

What is this?

stable-fast is an ultra lightweight inference optimization library for HuggingFace Diffusers on NVIDIA GPUs. stable-fast provides super fast inference optimization by utilizing some key techniques and features:

  • CUDNN Convolution Fusion: stable-fast implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of Conv + Bias + Add + Act computation patterns.
  • Low Precision & Fused GEMM: stable-fast implements a series of fused GEMM operators that compute with fp16 precision, which is fast than PyTorch's defaults (read & write with fp16 while compute with fp32).
  • NHWC & Fused GroupNorm: stable-fast implements a highly optimized fused NHWC GroupNorm + GELU operator with OpenAI's triton, which eliminates the need of memory format permutation operators.
  • Fully Traced Model: stable-fast improves the torch.jit.trace interface to make it more proper for tracing complex models. Nearly every part of StableDiffusionPipeline can be traced and converted to TorchScript. It is more stable than torch.compile and has a significantly lower CPU overhead than torch.compile and supports ControlNet and LoRA.
  • CUDA Graph: stable-fast can capture the UNet structure into CUDA Graph format, which can reduce the CPU overhead when the batch size is small.
  • Fused Multihead Attention: stable-fast just uses xformers and make it compatible with TorchScript.

Differences With Other Acceleration Libraries

  • Fast: stable-fast is specialy optimized for HuggingFace Diffusers. It achieves the best performance over all libraries.
  • Minimal: stable-fast works as a plugin framework for PyTorch. it utilizes existing PyTorch functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.

Performance Comparison

A100 SXM 80GB (SD v1.5, 512x512, fp16)

Framework Performance
Vanilla PyTorch 23 it/s
AITemplate 44 it/s
TensorRT 52 it/s
OneFlow 55 it/s
Stable Fast (with xformers & triton) 60 it/s

RTX 3090 Ti (SD v1.5, 512x512, fp16)

Framework Performance
Vanilla PyTorch 16 it/s
AITemplate 31 it/s
TensorRT 33 it/s
OneFlow 39 it/s
Stable Fast (with xformers & triton) 38 it/s

Usage

Installation

NOTE: stable-fast is currently only tested on Linux. You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).

Install From Source

# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first
pip install torch diffusers xformers 'triton>=2.1.0'

# (Optional) Makes the build much faster
pip install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

NOTE: Any usage outside sfast.compilers is not guaranteed to be backward compatible. NOTE: To get the best performance, xformers and OpenAI's triton>=2.1.0 need to be installed and enabled. You might need to build xformers from source to make it compatible with your PyTorch.

Some Common Methods To Speed Up PyTorch

# TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...
import packaging.version
import torch

if packaging.version.parse(torch.__version__) >= packaging.version.parse('1.12.0'):
    torch.backends.cuda.matmul.allow_tf32 = True

Optimize StableDiffusionPipeline

import torch
from diffusers import StableDiffusionPipeline
from sfast.compilers.stable_diffusion_pipeline_compiler import (compile,
                                                                CompilationConfig
                                                                )

def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        'runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    return model

model = load_model()

config = CompilationConfig.Default()
# xformers and triton are suggested for achieving best performance.
# It might be slow for triton to generate, compile and fine-tune kernels.
try:
    import xformers
    config.enable_xformers = True
except ImportError:
    print('xformers not installed, skip')
try:
    import triton
    config.enable_triton = True
except ImportError:
    print('triton not installed, skip')
# CUDA Graph is suggested for small batch sizes.
# After capturing, the model only accepts one fixed image size.
# If you want the model to be dynamic, don't enable it.
config.enable_cuda_graph = True
compiled_model = compile(model, config)

kwarg_inputs = dict(
    prompt=
    '(masterpiece:1,2), best quality, masterpiece, best detail face, lineart, monochrome, a beautiful girl',
    height=512,
    width=512,
    num_inference_steps=50,
    num_images_per_prompt=1,
)

# NOTE: Warm it up.
# The first call will trigger compilation and might be very slow.
# After the first call, it should be very fast.
output_image = compiled_model(**kwarg_inputs).images[0]

# Let's see the second call!
output_image = compiled_model(**kwarg_inputs).images[0]

stable-fast's People

Contributors

chengzeyi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.