Giter Site home page Giter Site logo

fastertransformer_backend's Introduction

Note: FasterTransformer development has transitioned to TensorRT-LLM. All developers are encouraged to leverage TensorRT-LLM and tensorrtllm_backend to get the latest improvements on LLM Inference. The NVIDIA/FasterTransformer repo will stay up, but will not have further development.

FasterTransformer Backend

The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM. In latest release, FasterTransformer backend supports the multi-node multi-GPU inference on T5 with the model of huggingface.

Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.

Table Of Contents

Support matrix

Models FP16 BF16 Tensor parallel Pipeline parallel
GPT/OPT Yes Yes Yes Yes
BLOOM Yes Yes Yes Yes
GPT-J Yes Yes Yes Yes
T5/UL2 Yes Yes Yes Yes
GPT-NeoX Yes Yes Yes Yes
BERT Yes Yes Yes Yes

Introduction

FasterTransformer backend hopes to integrate the FasterTransformer into Triton, leveraging the efficiency of FasterTransformer and serving capabilities of Triton. To run the GPT-3 model, we need to solve the following two issues: 1. How to run the auto-regressive model? 2. How to run the model with multi-gpu and multi-node?

For the issue of auto-regressive model, the workflow of auto-regressive model is like:

  1. FasterTransformer backend receives input [A]
  2. Compute the query (q), key (k) and value (v) by the input [A].
  3. Compute attention: qk = q * k and qkv = qk * v.
  4. Compute other operations of transformer, like Feed Forward Network.
  5. Generate next token B, return to Triton server.
  6. FasterTransformer backend receives inputs [A, B]
  7. Compute the query (q') by [B], keys (k') and values (v') by the inputs [A, B].
  8. Compute attention: qk' = q' * k' and qkv' = qk' * v'.
  9. Compute other operations of transformer, like Feed Forward Network.
  10. Generate next token C, return to Triton server.

We see that we need to compute the attention by current query and all keys and values. We can find that some computing are wasted, like computing the key of A at step 6 because we have the same results at step 2. To prevent these wasted computing, we need a mechanism to store these states. Currently, Triton does not support such feature, so FasterTransformer handles the whole workflow, storing the keys and values states, and only return the final results. The workflow in FasterTransformer is:

  1. Allocate a cache buffer K and V for keys and values respectively.
  2. FasterTransformer backend receives input [A]
  3. Compute the query (q), key (k) and value (v) by the input [A]. Put the k into K[0] and v into V[0].
  4. Compute attention: qk = q * K[:1] and qkv = qk * V[:1].
  5. Compute other operations of transformer, like Feed Forward Network.
  6. Generate next token B, set [B] as input.
  7. Compute the query (q'), key (k') and value (v') by the input [B]. Put the k' into K[1] and v' into V[1].
  8. Compute attention: qk' = q' * K[:2] and qkv' = qk' * V[:2].
  9. Compute other operations of transformer, like Feed Forward Network.
  10. Generate next token C, set [C] as input.

For the issue of running the model with multi-gpu and multi-node, FasterTransformer backend uses the MPI to communicate between multiple nodes, and uses multi-threads to control the GPUs in one node. Figure 1 demonstrates the workflow of multi-gpu and multi-node in FasterTransformer backend.

Fig. 1 Workflow of multi-gpu and multi-node in FasterTransformer backend.

Setup

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=23.04
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

Prepare docker images

The current official Triton Inference Server docker image doesn't contain FasterTransformer backend, thus the users must prepare own docker image either by:

  1. Using the build script Note the --is-multistage-build is optional. It installs the minimal dependencies that allow fastertransformer_backend to run
# Create your own Triton container. You can skip this step (done in trtionserver/server)
python3 compose.py --backend pytorch --container-version 23.04 --output-name tritonserver_pytorch_only
# In tritonserver/fastertransformer_backend. This will overwrite the current Dockerfile
python3 docker/create_dockerfile_and_build.py --base-image tritonserver_pytorch_only --image-name tritonserver_with_ft --is-multistage-build

Alternatively you can simply run

python3 create_dockerfile_and_build.py --triton-version 23.04

to generate a fastertransformer backend, like done in option 2.

  1. Using below command:
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .

For testing purposes' docker image will also contain set of tools for model deployment testing.

Push built docker images to docker registry, so that we can later obtain it and initialize it on multiple nodes.

docker tag ${TRITON_DOCKER_IMAGE} <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
docker push <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}

Rebuilding FasterTransformer backend (optional)

Every time you need to build updated fastertransformer_backend you can build docker image.

But also you can build it manually in interactive session (ex during fixing code on target node) with:

docker run -it \
    –shm-size=1g –ulimit memlock=-1 \
    -v ${WORKSPACE}:/workspace \
    --name ft_backend_builder \
    ${TRITON_DOCKER_IMAGE} bash
# in docker container
rm /opt/tritonserver/lib/cmake/FasterTransformer/ -rf # Remove original library
cd fastertransformer_backend
mkdir build -p && cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install

where ${WORKSPACE} should contain fastertransformer_backend directory with code to build.

Then you can commit changes to new docker image with:

docker commit ft_backend_builder ${TRITON_DOCKER_IMAGE}

NCCL_LAUNCH_MODE

In the docker file, NCCL_LAUNCH_MODE=GROUP is the default because it is less likely to hang. However, NCCL_LAUNCH_MODE=PARALLEL can bring better performance for communication. Hence, users may be able to try to use NCCL_LAUNCH_MODE=PARALLEL to accelerate.

In current environment:

export NCCL_LAUNCH_MODE=PARALLEL

When building the Docker container changing the Dockerfile:

ENV NCCL_LAUNCH_MODE=PARALLEL

Or passing environment variable on container start:

docker run -e NCCL_LAUNCH_MODE=PARALLEL ...

GPUs Topology

If your current machine/nodes are fully connected through PCIE or even across NUMA nodes, there could be poor NCCL performance or even NCCL hangs due to limited peer to peer communication. You can apply nvidia-smi topo -m to check the topology.

If you met timed-out or hangs, please first check the topology and try to use DGX V100 or DGX A100 with nvlink connected.

Model-Parallism and Triton-Multiple-Model-Instances

We apply MPI to start single-node/multi-node servers.

  • N: Number of MPI Processes/Number of Nodes
  • T: Tensor Parallel Size. Default 1
  • P: Pipeline Parallel Size. Default 1

Multiple model instances on same GPUs will share the weights, so there will not be any redundant weights memory allocated.

Run inter-node (T x P > GPUs per Node) models

  • total number of GPUs = num_gpus_per_node x N = T x P.
  • only single mode instance supported

Run intra-node (T x P <= GPUs per Node) models

  • total number of visible GPUs must be evenly divisble by T x P. Note that you can control this by setting CUDA_VISIBLE_DEVICES.

  • total number of visible GPUs must be <= T x P x Instance Count. It can avoid unnecessary cuda memory allocation on unused GPUs.

  • multiple model instances can be run on tsame GPU groups or different GPU groups.

    The backend will first try to assign different GPU groups to different model instances. If there are not empty GPUs, multiple model instances will be assigned to the same GPU groups.

    For example, if there are 8 GPUs, 8 model instances (T = 2, P = 1), then model instances will be distributed to GPU groups [0, 1], [2, 3], [4, 5], [6, 7], [0, 1], [2, 3], [4, 5], [6, 7].

  • weights are shared among model instances in same GPU groups. In the example above, instance 0 and instance 4 will share the same weights, and others are similar.

Specify Multiple Model Instances

Set count here to start multiple model instances. Note KIND_CPU is the only choice here as the backend needs to take full control of how to distribute multiple model instances to all the visible GPUs.

instance_group [
  {
    count: 8
    kind: KIND_CPU
  }
]

Multi-Node Inference

We currently do not support the case that different nodes have different number of GPUs.

We start one MPI process per node. If you need to run on three nodes, then you should launch 3 Nodes with one process per node. Remember to change tensor_para_size and pipeline_para_size if you run on multiple nodes.

We do suggest tensor_para_size = number of GPUs in one node (e.g. 8 for DGX A100), and pipeline_para_size = number of nodes (2 for two nodes). Other model configuration in config.pbtxt should be modified as normal.

Request examples

The tools directory provides python scripts to send requests to the triton server. You can build upon those examples to suit your needs.

Specifically tools/issue_request.py is a simple script that sends a request contained in a JSON file. You may use it with $python3 tools/issue_request.py tools/requests/sample_request.json, for example. You can also pass command-line arguments as a JSON-formatted string with the --params argument.

Changelog

Oct 2022

  • Support IA3 in T5 and T5-Encoder

Sep 2022

  • Support T5-Encoder only backend
  • Support T5 prompt tuning and p tuning
  • Support factual-nucleus sampling (link)

Aug 2022

  • Release the FasterTransformer backend 1.2.
  • Support for interactive generation

July 2022

  • Support shared context optimization in GPT model
  • Support UL2

June 2022

  • Support decoupled (streaming) mode.
  • Add demo of grpc protocol.
  • Support BERT

May 2022

  • Support GPT-NeoX.
  • Support optional input. (triton version must be after 22.05)

April 2022

  • Release the FasterTransformer backend 1.1.
  • Support bfloat16 inference in GPT model.
  • Support Nemo Megatron T5 and Megatron-LM T5 model.
  • Support optional input in fastertransformer backends. (Only supported after Triton 22.01)

Jan 2022

  • Move runtime argument like topk into backend input.

Nov 2021

  • Release FasterTransformer backend 1.1 beta version 2.
    • Support Multi-node Multi-GPU T5.
    • Support INT8 weight only quantization (Experimental).

Sep 2021

  • Release FasterTransformer backend 1.1 beta version 1.
    • Support Multi-node on GPT.

Apr 2021

  • Release the FasterTransformer backend 1.0.
    • Support Multi-GPU on GPT.

fastertransformer_backend's People

Contributors

ankit-db avatar byshiue avatar gwangsoohong avatar jbkyang-nvi avatar jsoref avatar rr0gi avatar samiur avatar thytu avatar yuanzhedong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastertransformer_backend's Issues

did fastertransformer support version nvcr.io/nvidia/tritonserver:21.07-py3

Description

branch :https://github.com/triton-inference-server/fastertransformer_backend.git main
docker version : nvcr.io/nvidia/tritonserver:21.07-py3
GPU: Tesla V100

when i flow the steps to build fastertransformer backend, an error happend:

451 BackendModel(TRITONBACKEND_Model* triton_model); 
/workspace/build/_deps/repo-backend-src/include/triton/backend/backend_model.h:45:3: note: candidate expects 1 argument, 2 provided 
[100%] Linking cxX executable../../../../../bin/multi_gpu_gpt_triton_example
[100%] Built target multi_gpu_gpt_triton_example
make[2]:***[CMakeFiles/triton-fastertransformer-backend.dir/build.make:82:CMakeFiles/triton-fastertransformer-backend.dir/src/libfastertransforner.cc.o] Error 1 make[1]:***[CMakeFiles/Makefile2:1457:CMakeFiles/triton-fastertransformer-backend.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs....
/workspace/build/_deps/repo-ft-src/src/fastertransformer/utils/nccl_utils.h:In function'bool test_context_sharing(const strings, const strings) [with T = float]'
/workspace/build/_deps/repo-ft-src/src/fastertransformer/utils/ncclutils.h:72:144:warning:'pipeline_para.fastertransformer::NcclParam:inccl_uid’may be used uninitiali ed in this function [-Wmaybe-uninitialized]
721 NcclParam(NcclParam const& param) 
/workspace/bui1d/_deps/repo-ft-src/src/fastertransformer/utils/nccutils.h:72:144:warning:'tensor_para.fastertransformer::NcclParam::nccl_uid_’ may be used uninitialize in this function [-Wmaybe-uninitialized]
721 NcclParam(NcclParam const& param): 
A
/
[100%] Linking CXX executable ../../../../bin/test_context_decoder_layer
[100%] Built target test_context_decoder_layer
[100%] Linking CXX executable ../../..../bin/test_sampling
[100%] Built target test_sampling
make: *** [Makefile:149: all] Error 2

Reproduced Steps

Reproduced steps:
1. 
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=21.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

2. docker run -it \
    -v ${WORKSPACE}:/workspace \
    --name ft_backend_builder \
    nvcr.io/nvidia/tritonserver:21.07-py3  bash
3. rm /opt/tritonserver/lib/cmake/FasterTransformer/ -rf # Remove original library
cd fastertransformer_backend

4. apt-key del 7fa2af80 && \
    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub && \
    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/7fa2af80.pub && \
    apt-get update && \
    apt-get install -y --no-install-recommends openssh-server zsh tmux mosh locales-all clangd sudo \
        zip unzip wget build-essential autoconf autogen gdb python3.8 python3-pip python3-dev rapidjson-dev \
        xz-utils zstd libz-dev && \
    pip3 install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html  && \
    pip3 install --extra-index-url https://pypi.ngc.nvidia.com regex fire tritonclient[all] && \
    pip3 install transformers huggingface_hub tokenizers SentencePiece sacrebleu && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*


export CMAKE_VERSION=3.18 && \
export CMAKE_BUILD=3.18.4 && \
    wget -nv https://cmake.org/files/v${CMAKE_VERSION}/cmake-${CMAKE_BUILD}.tar.gz && \
    tar -xf cmake-${CMAKE_BUILD}.tar.gz && \
    cd cmake-${CMAKE_BUILD} && \
    ./bootstrap --parallel=$(grep -c ^processor /proc/cpuinfo) -- -DCMAKE_USE_OPENSSL=OFF && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
    cd /workspace/build/ && \
    rm -rf /workspace/build/cmake-${CMAKE_BUILD}
5. mkdir build -p && cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D 
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install

Crash GPT-J if 'output0_len' is greater than 240.

Description

NVIDIA-SMI 510.85.02
Driver Version: 510.85.02
CUDA Version: 11.7 
Tesla V100-SXM2-32GB
triton_with_ft:22.07

This crash is not happened without TRT. It is fine to run it on FasterTransformer only.

I0905 11:41:20.868666 709 libfastertransformer.cc:1090] Start to forward
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:96

Signal (6) received.
 0# 0x000055BB7569EC19 in /opt/tritonserver/bin/tritonserver
 1# 0x00007F90714FA090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F90718B3911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F90718BF38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F90718BF3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F90718BF6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007F8FD4FEC37D in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# fastertransformer::invokeLengthCriterion(bool*, bool*, int*, unsigned int const*, int, int, int, CUstream_st*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# fastertransformer::DynamicDecodeLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
11# fastertransformer::GptJ<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GptJWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
12# GptJTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
13# 0x00007F90681B0C5A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
14# 0x00007F90718EBDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
15# 0x00007F9072B00609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

root@0b59991a1d0b:/workspace# I0905 11:40:12.707663 709 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9026000000' with size 268435456
I0905 11:40:12.708842 709 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0905 11:40:12.719048 709 model_repository_manager.cc:1206] loading: fastertransformer:1
I0905 11:40:12.719101 709 model_repository_manager.cc:1206] loading: postprocessing:1
I0905 11:40:12.719138 709 model_repository_manager.cc:1206] loading: preprocessing:1
I0905 11:40:12.857165 709 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I0905 11:40:12.857195 709 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I0905 11:40:12.857203 709 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I0905 11:40:12.859793 709 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
W0905 11:40:12.861707 709 libfastertransformer.cc:160] model configuration:
{
    "name": "fastertransformer",
    "platform": "",
    "backend": "fastertransformer",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "input_ids",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "start_id",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "end_id",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "input_lengths",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "request_output_len",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "runtime_top_k",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "runtime_top_p",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "beam_search_diversity_rate",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "temperature",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "len_penalty",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "repetition_penalty",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "random_seed",
            "data_type": "TYPE_UINT64",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "is_return_log_probs",
            "data_type": "TYPE_BOOL",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "beam_width",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "bad_words_list",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                2,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "stop_words_list",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                2,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "prompt_learning_task_name_ids",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "output_ids",
            "data_type": "TYPE_UINT32",
            "dims": [
                -1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "sequence_length",
            "data_type": "TYPE_UINT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "cum_log_probs",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "output_log_probs",
            "data_type": "TYPE_FP32",
            "dims": [
                -1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "fastertransformer_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "gpt-j-6b",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "data_type": {
            "string_value": "fp16"
        },
        "enable_custom_all_reduce": {
            "string_value": "0"
        },
        "model_type": {
            "string_value": "GPT-J"
        },
        "model_checkpoint_path": {
            "string_value": "/workspace/assets/fp32/1-gpu"
        },
        "pipeline_para_size": {
            "string_value": "1"
        },
        "tensor_para_size": {
            "string_value": "1"
        }
    },
    "model_warmup": [],
    "model_transaction_policy": {
        "decoupled": false
    }
}
I0905 11:40:12.861737 709 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I0905 11:40:12.861749 709 libfastertransformer.cc:248] Sequence Batching: disabled
I0905 11:40:12.861861 709 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 31.33 GB, total: 31.75 GB, used:  0.42 GB
I0905 11:40:43.625316 709 libfastertransformer.cc:430] After Loading Weights:
after allocation    : free: 20.05 GB, total: 31.75 GB, used: 11.70 GB
W0905 11:40:43.625476 709 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
W0905 11:40:48.552336 709 libfastertransformer.cc:651] Model name gpt-j-6b
W0905 11:40:48.552391 709 libfastertransformer.cc:661] Use COUPLED (classic) API.
W0905 11:40:48.552409 709 libfastertransformer.cc:756] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W0905 11:40:48.552419 709 libfastertransformer.cc:756] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552426 709 libfastertransformer.cc:756] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552433 709 libfastertransformer.cc:756] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552439 709 libfastertransformer.cc:756] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W0905 11:40:48.552448 709 libfastertransformer.cc:756] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552454 709 libfastertransformer.cc:756] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W0905 11:40:48.552461 709 libfastertransformer.cc:756] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W0905 11:40:48.552467 709 libfastertransformer.cc:756] Get input name: temperature, type: TYPE_FP32, shape: [1]
W0905 11:40:48.552473 709 libfastertransformer.cc:756] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W0905 11:40:48.552480 709 libfastertransformer.cc:756] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W0905 11:40:48.552486 709 libfastertransformer.cc:756] Get input name: random_seed, type: TYPE_UINT64, shape: [1]
W0905 11:40:48.552492 709 libfastertransformer.cc:756] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W0905 11:40:48.552498 709 libfastertransformer.cc:756] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552505 709 libfastertransformer.cc:756] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W0905 11:40:48.552512 709 libfastertransformer.cc:756] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W0905 11:40:48.552518 709 libfastertransformer.cc:756] Get input name: prompt_learning_task_name_ids, type: TYPE_UINT32, shape: [1]
W0905 11:40:48.552528 709 libfastertransformer.cc:798] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W0905 11:40:48.552534 709 libfastertransformer.cc:798] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W0905 11:40:48.552541 709 libfastertransformer.cc:798] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W0905 11:40:48.552548 709 libfastertransformer.cc:798] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
I0905 11:40:48.552762 709 libfastertransformer.cc:451] Before Loading Model:
after allocation    : free: 20.05 GB, total: 31.75 GB, used: 11.70 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0905 11:40:49.304929 709 libfastertransformer.cc:465] After Loading Model:
after allocation    : free: 19.83 GB, total: 31.75 GB, used: 11.92 GB
I0905 11:40:49.305087 709 libfastertransformer.cc:712] Model instance is created on GPU [ 0 ]
I0905 11:40:49.305112 709 libfastertransformer.cc:1590] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (count 1) (instance_id 0)
I0905 11:40:49.305156 709 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0 (CPU device 0)
I0905 11:40:49.305340 709 model_repository_manager.cc:1352] successfully loaded 'fastertransformer' version 1
Downloading tokenizer_config.json: 100%|##########| 252/252 [00:00<00:00, 313kB/s]
Downloading tokenizer.json: 100%|##########| 2.40M/2.40M [00:05<00:00, 435kB/s]
Downloading special_tokens_map.json: 100%|##########| 88.0/88.0 [00:00<00:00, 116kB/s]I0905 11:41:03.937539 709 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0 (CPU device 0)
I0905 11:41:03.937692 709 model_repository_manager.cc:1352] successfully loaded 'postprocessing' version 1
I0905 11:41:11.131309 709 model_repository_manager.cc:1352] successfully loaded 'preprocessing' version 1
I0905 11:41:11.131906 709 model_repository_manager.cc:1206] loading: ensemble:1
I0905 11:41:11.132295 709 model_repository_manager.cc:1352] successfully loaded 'ensemble' version 1
I0905 11:41:11.132375 709 server.cc:559]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0905 11:41:11.132441 709 server.cc:586]
+-------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
| Backend           | Path                                                            | Config                                                          |
+-------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertr | {"cmdline":{"auto-complete-config":"true","min-compute-capabili |
|                   | ansformer.so                                                    | ty":"6.000000","backend-directory":"/opt/tritonserver/backends" |
|                   |                                                                 | ,"default-max-batch-size":"4"}}                                 |
|                   |                                                                 |                                                                 |
| python            | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","min-compute-capabili |
|                   |                                                                 | ty":"6.000000","backend-directory":"/opt/tritonserver/backends" |
|                   |                                                                 | ,"default-max-batch-size":"4"}}                                 |
+-------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+

I0905 11:41:11.132495 709 server.cc:629]
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| ensemble          | 1       | READY  |
| fastertransformer | 1       | READY  |
| postprocessing    | 1       | READY  |
| preprocessing     | 1       | READY  |
+-------------------+---------+--------+

I0905 11:41:11.290519 709 metrics.cc:650] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I0905 11:41:11.292935 709 tritonserver.cc:2176]
+----------------------------------+--------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                              |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                             |
| server_version                   | 2.24.0                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration s |
|                                  | ystem_shared_memory cuda_shared_memory binary_tensor_data statistics trace                                         |
| model_repository_path[0]         | /workspace/all_models/gptj/                                                                                        |
| model_control_mode               | MODE_NONE                                                                                                          |
| strict_model_config              | 0                                                                                                                  |
| rate_limit                       | OFF                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                           |
| response_cache_byte_size         | 0                                                                                                                  |
| min_supported_compute_capability | 6.0                                                                                                                |
| strict_readiness                 | 1                                                                                                                  |
| exit_timeout                     | 30                                                                                                                 |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------+

I0905 11:41:11.294244 709 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0905 11:41:11.294613 709 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0905 11:41:11.336941 709 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Reproduced Steps

1. cd fasttransformer_backend
2. export WORKSPACE=$(pwd)
3. docker run -it \
    --gpus all \
    -v ${WORKSPACE}:/workspace \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    --shm-size=16G \
    --name ft \
    triton_with_ft:22.07 bash
4. CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver \
    --model-repository=/workspace/all_models/gptj
5. python3 /workspace/tools/identity_test.py -o 500

dose support have many same model instance in one GPU device?

Now I have a 3090 GPU. I deployed the Triton service using the fasttransformer as the backend. The model is GPT2, and there is still a lot of GPU memory left. So I want to deploy more same models to improve throughput, but when I set instance group count to 2, the throughput does not increase but decreases,and the request exception rate also increases.

My goal is to obtain greater throughput. What can I do in this case? Build more Triton services?

Request to support GCS file path

Hi,
I'm trying to deploy a Faster Transformer based LLM using Triton on a GCP instance. I was wondering if there's a way to provide the file path to the Google Cloud Storage bucket when passing the model check point in the config.pbtxt file?

parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "gs://triton_sample_models/model-ul2-ft/ul2/1/2-gpu"
  }
}

FT backend crashes Triton server if batch size is too large

Description

Branch: main
Docker version: 22.03
GPU type: 2x NVIDIA RTX A6000

Reproduced Steps

  1. Load a model with the fastertransformer backend.
  2. Make a query with a batch size that is too large for GPU memory.

The server crashes with:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:26 

[gv013:3677168] *** Process received signal ***
[gv013:3677168] Signal: Aborted (6)
[gv013:3677168] Signal code:  (-6)
[gv013:3677168] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x1472e53a8420]
[gv013:3677168] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x1472e3d9c00b]
[gv013:3677168] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x1472e3d7b859]
[gv013:3677168] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x1472e4155911]
[gv013:3677168] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x1472e416138c]
[gv013:3677168] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x1472e41613f7]
[gv013:3677168] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x1472e41616a9]
[gv013:3677168] [ 7] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer5checkI9cudaErrorEEvT_PKcS4_i+0x219)[0x147265e83ab9]
[gv013:3677168] [ 8] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer12deviceMallocI6__halfEEvPPT_ib+0x36)[0x147265ff6146]
[gv013:3677168] [ 9] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer10GptJWeightI6__halfE13mallocWeightsEv+0x60)[0x147265eccc40]
[gv013:3677168] [10] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN17fastertransformer10GptJWeightI6__halfEC2Eiiiiiiiii+0x148)[0x147265ed05e8]
[gv013:3677168] [11] /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so(_ZN15GptJTritonModelI6__halfE19createModelInstanceEiiP11CUstream_stSt4pairISt6vectorIP8ncclCommSaIS7_EES9_ESt10shared_ptrIN17fastertransformer18AbstractCustomCommEE+0x3f7)[0x147265ec23e7]
[gv013:3677168] [12] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x16eb3)[0x1472daa44eb3]
[gv013:3677168] [13] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x173c2)[0x1472daa453c2]
[gv013:3677168] [14] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2b23d)[0x1472daa5923d]
[gv013:3677168] [15] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x1472e418dde4]
[gv013:3677168] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x1472e539c609]
[gv013:3677168] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x1472e3e78133]
[gv013:3677168] *** End of error message ***

It would be better if the FT backend just detected the out of memory condition and returned an error code for the request, rather than raising an assertion that crashes the whole server.

How to support different models with different tensor_para_size?

I have 4 GPUs and 3 models called small, medium and large. I want to deploy small model on GPU 0, medium model on GPU 1, and large model on GPU 2 and GPU3 with tensor_para_size=2 due to large model is too huge that cannot be placed on single GPU.

However, the instance_group can only be KIND_CPU, so I can do nothing about it.

Is there any way to handler this?

Support BLOOM model?

Hi triton developer,

Currently, the fastertransformer backend only supports GPT, GPT-J, T5, GPT-NeoX and Bert.

I am wondering is there any plan to support bigscience/bloom model? If not, is it possible (and how) to convert BLOOM to one of the supported models and serve by triton with fastertransformer backend?

Streaming throwing queue.get() error

Description

Dockerfile: faster_transformer(v1.2)
Model: GPT-J

Reproduced Steps

The streaming example in issue_requests.py throws the following error when passing in a request:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/rtalari/coreweave/kubernetes-cloud/online-inference/fastertransformer/client/example_http.py", line 94, in stream_consumer
    result = queue.get()
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
TypeError: __init__() missing 1 required positional argument: 'msg'

A simple call to a sample request is passed which works with HTTP with output for "She is a good girl":

Ouput: She is a good girl. She's a good girl. She's a good girl...'

'But you were in bed when we came in,' she says, and I

Using GEMM files in fastertransformer_backend.

Discussed in #48

Originally posted by SnoozingSimian September 22, 2022
While loading both GPTJ and GPT-NeoX models, I get the message
[WARNING] gemm_config.in is not found; using default GEMM algo

This suggests to me that there is a way to add gemm algos while loading these models in, I have generated the gemm_config.in for GPT-NeoX using the FasterTransformer binaries, but I don't know where to place this file so that it can be found by the backend.

Is there any possible way to use it currently?

Results output same value with zero probability in GPTJ-6B

Here is a reproduction of the scenario.

Dockerfile:

# Copyright 2022 Rahul Talari ([email protected])

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG TRITON_VERSION=22.04
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:${TRITON_VERSION}-py3
FROM ${BASE_IMAGE} as server-builder

RUN export this_distro="$(cat /etc/os-release | grep '^ID=' | awk -F'=' '{print $2}')" \
    && export this_version="$(cat /etc/os-release | grep '^VERSION_ID=' | awk -F'=' '{print $2}' | sed 's/[^0-9]*//g')" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/7fa2af80.pub" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/3bf863cc.pub"
    
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    zip unzip wget build-essential autoconf autogen gdb \
    openssh-server zsh tmux mosh locales-all clangd sudo \
    python3.8 python3-pip python3-dev rapidjson-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /workspace/build/

# CMake
RUN CMAKE_VERSION=3.18 && \
    CMAKE_BUILD=3.18.4 && \
    wget -nv https://cmake.org/files/v${CMAKE_VERSION}/cmake-${CMAKE_BUILD}.tar.gz && \
    tar -xf cmake-${CMAKE_BUILD}.tar.gz && \
    cd cmake-${CMAKE_BUILD} && \
    ./bootstrap --parallel=$(grep -c ^processor /proc/cpuinfo) -- -DCMAKE_USE_OPENSSL=OFF && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
    cd /workspace/build/ && \
    rm -rf /workspace/build/cmake-${CMAKE_BUILD}

# backend build
WORKDIR /workspace/build/triton-experiments
ADD cmake cmake
ADD src src
ADD CMakeLists.txt CMakeLists.txt
ADD model_configs model_configs

ARG FORCE_BACKEND_REBUILD=0
RUN mkdir build -p && \
    cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install

FROM ${BASE_IMAGE} as server

# NVIDIA Keys
RUN export this_distro="$(cat /etc/os-release | grep '^ID=' | awk -F'=' '{print $2}')" \
    && export this_version="$(cat /etc/os-release | grep '^VERSION_ID=' | awk -F'=' '{print $2}' | sed 's/[^0-9]*//g')" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/7fa2af80.pub" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/3bf863cc.pub"

# Needed for runner image
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    zip unzip wget git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# TODO: Change to PARALLEL and see performance metrics
ENV NCCL_LAUNCH_MODE=GROUP

COPY --from=server-builder /opt/tritonserver/backends/fastertransformer /opt/tritonserver/backends/fastertransformer

# Can change based on system configuration
ARG NUM_GPUS=1

# set workspace, src and triton model store locations
ENV WORKSPACE /workspace

# Need to create directories where the models and checkpoints are stores
RUN mkdir -p ${WORKSPACE}/models
ENV SRC_MODELS_DIR=${WORKSPACE}/models

# This is where Triton looks to serve the models from fastertransformer backend
RUN mkdir -p ${WORKSPACE}/triton-model-store
ENV TRITON_MODELS_STORE=${WORKSPACE}/triton-model-store

ENV MODEL_PATH=${TRITON_MODELS_STORE}/fastertransformer

WORKDIR /workspace

# Install pytorch
RUN pip3 install torch==1.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip3 install --extra-index-url https://pypi.ngc.nvidia.com regex fire tritonclient[all]
RUN pip3 install --upgrade jax jaxlib

RUN git clone https://github.com/NVIDIA/FasterTransformer.git 
RUN wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models 
RUN wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
RUN wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.gz

RUN tar -xzvf step_383500_slim.tar.gz -C ${SRC_MODELS_DIR}
RUN mkdir ${MODEL_PATH}/1 -p 

RUN python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptj/utils/gptj_ckpt_convert.py --output-dir ${MODEL_PATH}/1 --ckpt-dir ${SRC_MODELS_DIR}/step_383500/ --n-inference-gpus ${NUM_GPUS}

# Copy configs from server so that Triton understands how to reference the model
COPY --from=server-builder /workspace/build/triton-experiments/model_configs/gptj-6B/config.pbtxt ${MODEL_PATH}/

# ensure the structure matches Triton Model Specification 
RUN mv ${MODEL_PATH}/1/${NUM_GPUS}-gpu/* ${MODEL_PATH}/1/
RUN rm -rf ${MODEL_PATH}/1/${NUM_GPUS}-gpu

RUN mkdir /etc/ssh -p && \
    touch /etc/ssh/sshd_config && \
    sed -i 's/#X11UseLocalhost yes/X11UseLocalhost no/g' /etc/ssh/sshd_config
RUN mkdir /var/run/sshd -p

ADD . /workspace/triton-experiments

# Run the server with the model repository
ENTRYPOINT ["/opt/tritonserver/bin/tritonserver",  "--model-repository=/workspace/triton-model-store"]

When I call the endpoint with HTTP with the sample_request.json with issue_request.py, I expect to see log probabilities populated and generated tokens to be different and not repeat. This is the output currently:

sequence_length:
[[9]
 [9]
 [9]
 [9]
 [9]
 [9]
 [9]
 [9]]

output_ids:
[[[ 9915 27221    59    77   383  1853  3327  1462 11799 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[ 6601  4237   345   460   779   284   787   257  2551 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[   59    77   611     7  9248   796   657     8   788 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[   38 10128  6032   651  8699     4  4048 20753    11 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[21448  7006   930 12901   930  7406  7006   198   198 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[13256    11   281  1605  3370    11  1444  6771   564 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[ 9915 27221    59    77   383  1853  3327  1462 11799 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]

 [[ 6601  4237   345   460   779   284   787   257  2551 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256 50256
   50256 50256 50256 50256 50256 50256 50256 50256 50256 50256]]]

output_log_probs:
[[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]

 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
   0.]]]

cum_log_probs:
[[-2.6934752e-01]
 [-2.7460208e+00]
 [-1.5650676e+00]
 [-1.3822436e+00]
 [-3.2310936e-04]
 [-1.2593781e+00]
 [-2.6934752e-01]
 [-2.7460208e+00]]

I think that the standard build provided with fastertransformer backend and GPTJ-6B is not working as intended. Can you please help with this issue and look into it?

How can I get the logits of all tokens in vocab at each step?

Hey, thanks for providing such a great tool! I noticed that gpt_guide.md mentions a parameter: output_log_probs . It records the log probability of logits at each step for sampling. output_log_probs 's shape is: [batch_size, beam_width, request_output_seq_len]

Is it possible to provide a parameter that records the logits of all output positions, so that I can see the topk output of any positions.

This param's shape may be : [batch_size, beam_width, request_output_seq_len, vocab_size]

Fastertransformer BERT returns wrong value in my environment.

Description

fastertransformer_backend: 225b57898b830
FasterTransformer: aa3aaf6d2ed0359ceb
Docker version: 20.10.17, build 100c701
NVIDIA Driver Version: 515.65.01
GPU: Tesla T4 x 1

Fastertransformer seems return wrong value for never seen input data in my environment.
(not always, but often)
Is there something wrong in my build and setup procedure?
I tried with NVIDIA Driver 460.32.03, 515.65.01, but no luck.

Reproduced Steps

  1. clone fastertransformer_backend repo.
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
  1. Build container image.
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .
  1. Run container.
docker run -it --rm --gpus=all -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
  1. Convert Huggingface's BERT (in container)
export WORKSPACE=$(pwd)

sudo apt-get install git-lfs
git lfs install
git lfs clone https://huggingface.co/bert-base-uncased # Download model from huggingface
git clone https://github.com/NVIDIA/FasterTransformer.git # To convert checkpoint
export PYTHONPATH=${WORKSPACE}/FasterTransformer:${PYTHONPATH}
python3 FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
        -in_file bert-base-uncased/ \
        -saved_dir ${WORKSPACE}/all_models/bert/fastertransformer/1/ \
        -infer_tensor_para_size 1
  1. Modify config.pbtxt (due to -infer_tensor_para_size 1)
sed -i -e 's/string_value: "2"/string_value: "1"/' \
  -e "s#../all_models/bert/fastertransformer/1/2-gpu/#$WORKSPACE/all_models/bert/fastertransformer/1/1-gpu/#" all_models/bert/fastertransformer/config.pbtxt
rm -rf all_models/bert/fastertransformer/1/2-gpu/
  1. Start Triton.
CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/ &
  1. Invoke test script (UserWarnings are removed for brevity).
root@77b40022d054:/home/uno/fastertransformer_backend# ./test.py --text "I have a pen."
Loading HF tokenizer and model...
Tokenizer text=I have a pen.
Prepare input hidden state
Invoke Huggingface
Invoke FasterTransformer
set request
I1024 06:52:51.197681 120 libfastertransformer.cc:1090] Start to forward
I1024 06:52:51.200114 120 libfastertransformer.cc:1098] Stop to forward
get results as output_hidden_state

ft_output_hidden_state: 
[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]
abs max diff: 8.78125
abs mean diff: 0.08430705964565277

root@77b40022d054:/home/uno/fastertransformer_backend# ./test.py --text "I have a pen."
Loading HF tokenizer and model...
Tokenizer text=I have a pen.
Prepare input hidden state
Invoke Huggingface
Invoke FasterTransformer
set request
I1024 06:53:26.158261 120 libfastertransformer.cc:1090] Start to forward
I1024 06:53:26.160828 120 libfastertransformer.cc:1098] Stop to forward
get results as output_hidden_state

ft_output_hidden_state: 
[[-0.04846  0.01334]
 [ 0.227    0.4778 ]
 [ 0.3364   0.642  ]
 [ 0.1345   0.3196 ]
 [ 1.201    0.5034 ]]
abs max diff: 0.0390625
abs mean diff: 0.00032764553907327354

root@77b40022d054:/home/uno/fastertransformer_backend# ./test.py --text "I have an apple."
Loading HF tokenizer and model...
Tokenizer text=I have an apple.
Prepare input hidden state
Invoke Huggingface
Invoke FasterTransformer
set request
I1024 06:53:51.891521 120 libfastertransformer.cc:1090] Start to forward
I1024 06:53:51.894008 120 libfastertransformer.cc:1098] Stop to forward
get results as output_hidden_state

ft_output_hidden_state: 
[[-0.04846  0.01334]
 [ 0.227    0.4778 ]
 [ 0.3364   0.642  ]
 [ 0.1345   0.3196 ]
 [ 1.201    0.5034 ]]
abs max diff: 2.24365234375
abs mean diff: 0.04817396402359009

root@77b40022d054:/home/uno/fastertransformer_backend# ./test.py --text "I have an apple."
Loading HF tokenizer and model...
Tokenizer text=I have an apple.
Prepare input hidden state
Invoke Huggingface
Invoke FasterTransformer
set request
I1024 06:54:18.815628 120 libfastertransformer.cc:1090] Start to forward
I1024 06:54:18.818453 120 libfastertransformer.cc:1098] Stop to forward
get results as output_hidden_state

ft_output_hidden_state: 
[[-0.00159  0.181  ]
 [ 0.1421   0.4343 ]
 [ 0.2769   0.703  ]
 [-0.03543  1.12   ]
 [ 0.313    1.049  ]]
abs max diff: 0.015625
abs mean diff: 0.0003199076163582504
  1. test.py
#!/usr/bin/python3
import argparse
import numpy as np
import sys
from builtins import range
import statistics as s
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
import random
import torch
from transformers import BertModel, AutoTokenizer


def prepare_tensor(name, input):
    t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def create_inference_server_client(protocol, url, concurrency, verbose):
    return grpcclient.InferenceServerClient(url, verbose=verbose)

def sequence_mask(lengths, max_len=None, is_2d=True):
    batch_size = lengths.numel()
    max_len = max_len or lengths.max()
    mask = (torch.arange(0, max_len, device=lengths.device)
            .type_as(lengths).repeat(batch_size, 1).lt(lengths.unsqueeze(1)))
    if is_2d:
        return mask
    else:
        mask = mask.view(-1, 1, 1, max_len)
        m2 = mask.transpose(2, 3)
        return mask * m2

def send_requests(url, input_hidden_state, sequence_lengths,
                  flags, request_parallelism=10):
    model_name = "fastertransformer"
    with create_inference_server_client(flags.protocol,
                                        url,
                                        concurrency=request_parallelism,
                                        verbose=False) as client:
        requests = []
        results = []
        
        for i in range(request_parallelism):
            inputs = [
                prepare_tensor("input_hidden_state", input_hidden_state),
                prepare_tensor("sequence_lengths", sequence_lengths),
            ]

            print("set request")
            result = client.infer(model_name, inputs)
            results.append(result)

        for i in range(request_parallelism):

            output_hidden_state = results[i].as_numpy("output_hidden_state")
            print("get results as output_hidden_state\n")
            if output_hidden_state is None:
                print("error: expected 'output_hidden_state'")
                sys.exit(1)
            else:
                pass
    return output_hidden_state

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--text", type=str, default="I have an apple.", help="text data")
    FLAGS = parser.parse_args()
    FLAGS.protocol = "grpc"
    FLAGS.url = "localhost:8001"
    FLAGS.seq_len=32
    FLAGS.hf_ckpt_path="bert-base-uncased"
    FLAGS.inference_data_type="fp16"

    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    print("Loading HF tokenizer and model...")
    tokenizer = AutoTokenizer.from_pretrained(FLAGS.hf_ckpt_path)
    hf_model = BertModel.from_pretrained(FLAGS.hf_ckpt_path)
    hf_model.cuda().eval().to(torch.float16)

    print("Tokenizer text={}".format(FLAGS.text))
    encoded = tokenizer.batch_encode_plus([FLAGS.text], return_tensors="pt", padding="max_length", max_length=FLAGS.seq_len)
    encoded = {k:v.to(device) for k, v in encoded.items()}

    print("Prepare input hidden state")
    input_hidden_state = hf_model.embeddings(input_ids=encoded["input_ids"], token_type_ids=encoded["token_type_ids"])
    input_hidden_state = input_hidden_state.cpu().detach().numpy().astype(np.float16)
    sequence_lengths = torch.sum(encoded['attention_mask'], axis=1).cpu().detach().numpy()[np.newaxis,:].astype(np.int32)

    mask = sequence_mask(torch.Tensor(sequence_lengths.reshape([-1])).to(torch.int), FLAGS.seq_len, False)
    mask = mask.to(torch.float16).cuda()
    output_mask = sequence_mask(torch.Tensor(sequence_lengths.reshape([-1])).to(torch.int), FLAGS.seq_len)
    output_mask = output_mask.to(mask.dtype).unsqueeze(-1).cuda()
    
    print("Invoke Huggingface")
    hf_input = torch.Tensor(input_hidden_state).cuda().to(torch.float16)
    extended_attention_mask = (1.0 - mask) * -10000.0
    hf_output = hf_model.encoder(hf_input, extended_attention_mask, return_dict=False)[0] * output_mask

    print("Invoke FasterTransformer")
    ft_output_hidden_state = send_requests(FLAGS.url, input_hidden_state, sequence_lengths,
                                           FLAGS, request_parallelism=1)

    print("ft_output_hidden_state: \n{}".format(ft_output_hidden_state[0, :5, :2]))
    diff = (torch.Tensor(ft_output_hidden_state).cuda() * output_mask) - hf_output

    print(f"abs max diff: {diff.abs().max()}")
    print(f"abs mean diff: {diff.abs().mean()}")

Can't run multi-node GPTJ inference

I followed the tutorial provided here. I am able to run GPTJ-B on a single node. However, when I try the multi-node inference example with the following command on two nodes:

WORKSPACE="/workspace" # the dir you build the docker
CONTAINER_VERSION=22.07
IMAGE=bdhu/triton_with_ft:22.07
CMD="/opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/all_models/gptj"
srun -N 2 -n 2 --mpi=pmi2 -o inference_server.log \
               --container-mounts /home/edwardhu/my-script/triton:${WORKSPACE} \
               --container-name multi-node-ft-triton \
               --container-image ${IMAGE} \
               bash -c "$CMD"

It shows the following error in the log file:

...


E1011 02:30:34.325725 21361 libfastertransformer.cc:168] Invalid configuration argument 'is_half': stoi
E1011 02:30:34.325737 21361 libfastertransformer.cc:168] Invalid configuration argument 'max_seq_len': stoi
E1011 02:30:34.325742 21361 libfastertransformer.cc:168] Invalid configuration argument 'head_num': stoi
E1011 02:30:34.325747 21361 libfastertransformer.cc:168] Invalid configuration argument 'size_per_head': stoi
E1011 02:30:34.325752 21361 libfastertransformer.cc:168] Invalid configuration argument 'inter_size': stoi
E1011 02:30:34.325757 21361 libfastertransformer.cc:168] Invalid configuration argument 'decoder_layers': stoi
E1011 02:30:34.325761 21361 libfastertransformer.cc:168] Invalid configuration argument 'vocab_size': stoi
E1011 02:30:34.325766 21361 libfastertransformer.cc:168] Invalid configuration argument 'rotary_embedding': stoi
E1011 02:30:34.325770 21361 libfastertransformer.cc:168] Invalid configuration argument 'start_id': stoi
E1011 02:30:34.325775 21361 libfastertransformer.cc:168] Invalid configuration argument 'end_id': stoi
W1011 02:30:34.325785 21361 libfastertransformer.cc:334] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
I1011 02:30:34.327796 24630 libfastertransformer.cc:420] Before Loading Weights:
I1011 02:30:34.328497 21361 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W1011 02:30:34.328515 21361 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W1011 02:30:34.328518 21361 libfastertransformer.cc:459] Model name gpt-j-6b
W1011 02:30:34.328540 21361 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328543 21361 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328546 21361 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328548 21361 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328551 21361 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328554 21361 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328556 21361 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328559 21361 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328561 21361 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328572 21361 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328574 21361 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W1011 02:30:34.328577 21361 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_UINT64, shape: [1]
W1011 02:30:34.328579 21361 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W1011 02:30:34.328581 21361 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328584 21361 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328587 21361 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W1011 02:30:34.328590 21361 libfastertransformer.cc:578] Get input name: prompt_learning_task_name_ids, type: TYPE_UINT32, shape: [1]
W1011 02:30:34.328594 21361 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W1011 02:30:34.328597 21361 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W1011 02:30:34.328599 21361 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W1011 02:30:34.328602 21361 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
after allocation    : free: 15.36 GB, total: 15.78 GB, used:  0.42 GB
[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.input_layernorm.weight.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.query_key_value.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.attention.dense.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_h_to_4h.bias.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.14.mlp.dense_4h_to_h.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.bias.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.input_layernorm.weight.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.query_key_value.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.attention.dense.weight.0.bin 

[FT][WARNING] shape is zero, skip loading weight from file /workspace/all_models/gptj/fastertransformer/1/4-gpu//model.layers.15.mlp.dense_h_to_4h.weight.0.bin 


.... many similar warnings



after allocation    : free: 13.27 GB, total: 15.78 GB, used:  2.51 GB
W1011 02:30:46.476398 24630 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backend
[node5:21361] *** An error occurred in MPI_Bcast
[node5:21361] *** reported by process [4915207,1]
[node5:21361] *** on communicator MPI_COMM_WORLD
[node5:21361] *** MPI_ERR_TRUNCATE: message truncated
[node5:21361] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node5:21361] ***    and potentially your MPI job)
slurmstepd: error: *** STEP 75.7 ON node4 CANCELLED AT 2022-10-10T21:30:46 ***

Is there any hint over how to resolve this issue? Thanks!

GPT-NeoX throws Segmentation Fault (Signal 6)

@byshiue Getting this error when launching Triton with GPT-NeoX.

I0909 00:01:05.466214 1 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer (device 0)
W0909 00:01:05.534072 1 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W0909 00:01:05.534113 1 libfastertransformer.cc:459] Model name gptneox_20b
W0909 00:01:05.534128 1 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W0909 00:01:05.534133 1 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534136 1 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534140 1 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534144 1 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W0909 00:01:05.534148 1 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534151 1 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W0909 00:01:05.534156 1 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W0909 00:01:05.534160 1 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W0909 00:01:05.534166 1 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W0909 00:01:05.534170 1 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W0909 00:01:05.534175 1 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_UINT64, shape: [1]
W0909 00:01:05.534179 1 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W0909 00:01:05.534184 1 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534189 1 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W0909 00:01:05.534193 1 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W0909 00:01:05.534197 1 libfastertransformer.cc:578] Get input name: prompt_learning_task_name_ids, type: TYPE_UINT32, shape: [1]
W0909 00:01:05.534204 1 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W0909 00:01:05.534210 1 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W0909 00:01:05.534214 1 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W0909 00:01:05.534219 1 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:1    :0:11] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:     11) ====
 0 0x00000000000143c0 __funlockfile()  ???:0
 1 0x000000000001ef3a triton::backend::fastertransformer_backend::ModelInstanceState::ModelInstanceState()  :0
 2 0x000000000001fd42 triton::backend::fastertransformer_backend::ModelInstanceState::Create()  :0
 3 0x000000000002263c TRITONBACKEND_ModelInstanceInitialize()  ???:0
 4 0x000000000010ce8a triton::core::TritonModelInstance::CreateInstance()  :0
 5 0x000000000010e971 triton::core::TritonModelInstance::CreateInstances()  :0
 6 0x0000000000101a10 triton::core::TritonModel::Create()  :0
 7 0x00000000001b217a triton::core::ModelRepositoryManager::ModelLifeCycle::CreateModel()  :0
 8 0x00000000001c0fa1 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::core::Status (triton::core::ModelRepositoryManager::ModelLifeCycle::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, triton::core::ModelRepositoryManager::ModelLifeCycle::ModelInfo*), triton::core::ModelRepositoryManager::ModelLifeCycle*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, triton::core::ModelRepositoryManager::ModelLifeCycle::ModelInfo*> > >::_M_run()  :0
 9 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
10 0x0000000000008609 start_thread()  ???:0
11 0x000000000011f163 clone()  ???:0
=================================
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] *** Process received signal ***
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] Signal: Segmentation fault (11)
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] Signal code:  (-6)
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] Failing at address: 0x1
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f3d34fa43c0]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 1] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x1ef3a)[0x7f3d22a70f3a]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 2] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x1fd42)[0x7f3d22a71d42]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 3] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(TRITONBACKEND_ModelInstanceInitialize+0x38c)[0x7f3d22a7463c]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 4] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x10ce8a)[0x7f3d34245e8a]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 5] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x10e971)[0x7f3d34247971]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 6] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x101a10)[0x7f3d3423aa10]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 7] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b217a)[0x7f3d342eb17a]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 8] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1c0fa1)[0x7f3d342f9fa1]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f3d33d88de4]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [10] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f3d34f98609]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3d33a73163]
[fastertransformer-triton-predictor-default-00001-deploymenknpmt:00001] *** End of error message ***

I download and convert the weights for GPT-NeoX according to the guide and set the checkpoint path appropriately. Here are my config.ini and config.pbtxt:

config.ini

[gptneox]
model_name=gptneox_20B
head_num=64
size_per_head=96
vocab_size=50432
num_layer=44
rotary_embedding=24
start_id=0
end_id=2
inter_size=24576
use_gptj_residual=1
weight_data_type=fp32

config.pbtxt

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gptneox_20b"
max_batch_size: 1024

model_transaction_policy {
decoupled: False
}

input [
{
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
},
{
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
},
{
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
},
{
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
},
{
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
},
{
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
},
{
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
}
]
output [
{
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
},
{
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
},
{
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
},
{
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
}
]
parameters {
key: "tensor_para_size"
value: {
    string_value: "2"
}
}
parameters {
key: "pipeline_para_size"
value: {
    string_value: "1"
}
}
parameters {
key: "data_type"
value: {
    string_value: "fp32"
}
}
parameters {
key: "model_type"
value: {
    string_value: "GPT-NeoX"
}
}
parameters {
key: "model_checkpoint_path"
value: {
    string_value: "/mnt/pvc/triton-model-store/fastertransformer/1/"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
    string_value: "0"
}
}

Have you encountered this issue?

Memory usage not going up with model instances

Hi,

I am using this backend for inference with GPT-J model (Codegen converted to GPT-J checkpoint to be precise). And I'm trying to load more than one model instances to process concurrent requests. However, with increasing no. of models, the GPU memory usage doesn't go up. The first model takes about 6GB of memory but all subsequent models only result in tiny fraction of that memory. Was wondering if this is a bug?

Here're the relevant details of the confix.pbtxt file:

instance_group [
  {
    count: 3
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}

Any help would be appreciated!

server crashs when traffic is a little bit high

Description

main branch, V100

Deployed docker pods crashs and restarts every few minutes. It seems stable when qps is low.
Below is error log before pods crashs which I get using command: kubectl logs <pod-name> --all-containers -p
 0# 0x000055F47681BC19 in tritonserver
 1# 0x00007F17168FE090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F1716CB7911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F1716CC338C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F1716CC33F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F1716CC36A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 8# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# 0x00007F170C34EE0A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
10# 0x00007F1716CEFDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
11# 0x00007F1717F04609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
12# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also I observed below error response from triton client sometimes:
[StatusCode.INTERNAL] pinned buffer: failed to perform CUDA copy: invalid argument

Maybe the client error is related to server pod crash.
Thanks so much if it could get fixed.

Reproduced Steps

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)exportCONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}\
    -t ${TRITON_DOCKER_IMAGE}\
    -f docker/Dockerfile \
docker tag ${TRITON_DOCKER_IMAGE}<github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}
docker push <github_or_gitlab/repo_name/image_name>:${CONTAINER_VERSION}

Upload ft model folders and deploy triton services to docker, then test using triton client API

Crash GPT-J on mGPU

Description

Additional information: Running on single GPU is fine.

NVIDIA-SMI 510.85.02
Driver Version: 510.85.02
CUDA Version: 11.7 
Tesla V100-SXM2-32GB
triton_with_ft:22.07


/workspace/tools/identity_test.py:102: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.ones([input_start_ids.shape[0], 1]).astype(np.bool)
set request
W0902 08:44:50.026851 940 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0, executing 1 requests
W0902 08:44:50.026892 940 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0 with 1 requests
W0902 08:44:50.026904 940 libfastertransformer.cc:886] get total batch_size = 8
W0902 08:44:50.026919 940 libfastertransformer.cc:1296] get input count = 16
W0902 08:44:50.026935 940 libfastertransformer.cc:1368] collect name: start_id size: 32 bytes
W0902 08:44:50.026946 940 libfastertransformer.cc:1368] collect name: input_ids size: 704 bytes
W0902 08:44:50.026954 940 libfastertransformer.cc:1368] collect name: bad_words_list size: 64 bytes
W0902 08:44:50.026965 940 libfastertransformer.cc:1368] collect name: random_seed size: 64 bytes
W0902 08:44:50.026973 940 libfastertransformer.cc:1368] collect name: end_id size: 32 bytes
W0902 08:44:50.026981 940 libfastertransformer.cc:1368] collect name: input_lengths size: 32 bytes
W0902 08:44:50.026990 940 libfastertransformer.cc:1368] collect name: request_output_len size: 32 bytes
W0902 08:44:50.026998 940 libfastertransformer.cc:1368] collect name: runtime_top_k size: 32 bytes
W0902 08:44:50.027005 940 libfastertransformer.cc:1368] collect name: runtime_top_p size: 32 bytes
W0902 08:44:50.027013 940 libfastertransformer.cc:1368] collect name: temperature size: 32 bytes
W0902 08:44:50.027020 940 libfastertransformer.cc:1368] collect name: len_penalty size: 32 bytes
W0902 08:44:50.027031 940 libfastertransformer.cc:1368] collect name: is_return_log_probs size: 8 bytes
W0902 08:44:50.027042 940 libfastertransformer.cc:1368] collect name: stop_words_list size: 64 bytes
W0902 08:44:50.027053 940 libfastertransformer.cc:1368] collect name: beam_width size: 32 bytes
W0902 08:44:50.027066 940 libfastertransformer.cc:1368] collect name: beam_search_diversity_rate size: 32 bytes
W0902 08:44:50.027078 940 libfastertransformer.cc:1368] collect name: repetition_penalty size: 32 bytes
W0902 08:44:50.027088 940 libfastertransformer.cc:1379] the data is in CPU
W0902 08:44:50.027094 940 libfastertransformer.cc:1386] the data is in CPU
W0902 08:44:50.027163 940 libfastertransformer.cc:1244] before ThreadForward 0
W0902 08:44:50.027288 940 libfastertransformer.cc:1252] after ThreadForward 0
W0902 08:44:50.027302 940 libfastertransformer.cc:1244] before ThreadForward 1
W0902 08:44:50.027357 940 libfastertransformer.cc:1252] after ThreadForward 1
I0902 08:44:50.027371 940 libfastertransformer.cc:1090] Start to forward
I0902 08:44:50.027407 940 libfastertransformer.cc:1090] Start to forward
terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
Signal (  what():  [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:96

6) received.
Signal (6) received.
 0# 0x000055C3734BAC19 in /opt/tritonserver/bin/tritonserver
 1# 0x00007FA6EEF99090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007FA6EF36053A in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007FA6EF35E38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007FA6EF35E3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007FA6EF35E6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007FA63CFEC37D in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# fastertransformer::invokeLengthCriterion(bool*, bool*, int*, unsigned int const*, int, int, int, CUstream_st*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# fastertransformer::DynamicDecodeLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
11# fastertransformer::GptJ<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GptJWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
12# GptJTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
13# 0x00007FA6E4434C5A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
14# 0x00007FA6EF38ADE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
15# 0x00007FA6F059F609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 163, in get_socket
    return self._socket_queue.get(block=False)
  File "src/gevent/queue.py", line 335, in gevent._gevent_cqueue.Queue.get
  File "src/gevent/queue.py", line 350, in gevent._gevent_cqueue.Queue.get
  File "src/gevent/queue.py", line 319, in gevent._gevent_cqueue.Queue._Queue__get_or_peek

_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/tools/identity_test.py", line 273, in <module>
    send_requests(FLAGS.url, FLAGS.batch_size, input_start_ids,
  File "/workspace/tools/identity_test.py", line 136, in send_requests
    result = client.infer(model_name, inputs)
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 1414, in infer
    response = self._post(request_uri=request_uri,
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 309, in _post
    response = self._client_stub.post(request_uri=request_uri,
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post
    return self.request(METHOD_POST, request_uri, body=body, headers=headers)
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 226, in request
    sock = self._connection_pool.get_socket()
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 166, in get_socket
    return self._create_socket()
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 127, in _create_socket
    raise first_error
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 114, in _create_socket
    sock = self._connect_socket(sock, sock_info[-1])
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 136, in _connect_socket
    sock.connect(address)
  File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 607, in connect
    raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused
root@3acd604a7360:/workspace# --------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 3acd604a7360 exited on signal 6 (Aborted).
--------------------------------------------------------------------------


### Reproduced Steps

```shell
Step to reproduce:
1. cd fasttransformer_backend
2. export WORKSPACE=$(pwd)
3. docker run -it \
    --gpus all \
    -v ${WORKSPACE}:/workspace \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    --shm-size=2G \
    --name ft \
    triton_with_ft:22.07 bash
4. CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver \
    --model-repository=/workspace/all_models/gptj
5. python3 /workspace/tools/identity_test.py

GPT-J Preprocessing Incorrectly Tokenizes `<|endoftext|>`

Description

Expected behavior:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
>>> tokenizer.encode('<|endoftext|>')
[50256]

Reproduced Steps

Actual behavior:

$ cd all_models/gptj/preprocessing/1
$ python
>>> from word_list import to_word_list_format
>>> import numpy as np
>>> to_word_list_format(np.array([['<|endoftext|>']]))
array([[[  27,   91,  437, 1659, 5239,   91,   29],
        [   7,   -1,   -1,   -1,   -1,   -1,   -1]]], dtype=int32)

BPE merges seem to be working correctly. However, during pre-tokenization, <|endoftext|> is broken up into <|, endoftext, |>, with merges being applied to each of the parts separately. This seems incorrect, if we're using the huggingface implementation as reference.

I came across this bug trying to ban <|endoftext|> using the bad_words parameter.

Error if Triton Binary is started early

I've run into a situation where I will get this error.

...
W0401 23:09:32.708642 1 libfastertransformer.cc:648] input: OUTPUT1, type: TYPE_FP32, shape: [63, 63]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:314

[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** Process received signal ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal: Aborted (6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal code:  (-6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff1854a83c0]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff184c3018b]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7ff184c0f859]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff184fe6911]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff184ff238c]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff184ff23f7]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff184ff26a9]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 7] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99)[0x7ff120d9dd99]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 8] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5)[0x7ff120d879c5]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 9] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66)[0x7ff120d87e66]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [10] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a)[0x7ff120d9d40a]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [11] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff18501ede4]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff18549c609]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff184d0c293]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** End of error message ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:1    :0:32] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
Before Loading Model:
==== backtrace (tid:     32) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7ff022b43824]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7ff022b439ff]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7ff022b43d34]
 3  /lib/x86_64-linux-gnu/libc.so.6(abort+0x213) [0x7ff184c0f941]
 4  /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7ff184fe6911]
 5  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7ff184ff238c]
 6  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7ff184ff23f7]
 7  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7ff184ff26a9]
 8  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99) [0x7ff120d9dd99]
 9  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5) [0x7ff120d879c5]
10  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66) [0x7ff120d87e66]
11  /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a) [0x7ff120d9d40a]
12  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4) [0x7ff18501ede4]
13  /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7ff18549c609]
14  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff184d0c293]
=================================
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** Process received signal ***
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal: Segmentation fault (11)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Signal code:  (-6)
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] Failing at address: 0x1
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7ff1854a83c0]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 1] /lib/x86_64-linux-gnu/libc.so.6(abort+0x213)[0x7ff184c0f941]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 2] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7ff184fe6911]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7ff184ff238c]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7ff184ff23f7]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7ff184ff26a9]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 6] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x27d99)[0x7ff120d9dd99]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 7] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x119c5)[0x7ff120d879c5]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 8] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x11e66)[0x7ff120d87e66]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [ 9] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x2740a)[0x7ff120d9d40a]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [10] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff18501ede4]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [11] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7ff18549c609]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] [12] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff184d0c293]
[triton-small-gpt-a20cc9d0-7d647dd56-slbgq:00001] *** End of error message ***

To reproduce this I first built the image as described in the README on the dev/v1.1_beta branch with Triton version 21.08.
Then I run the container as such

docker run -it --rm -d --gpus=4 -p8000:8000 -p8001:8001 -p8002:8002 \
    -v ${TRITON_MODELS_STORE}:/model-store:ro \
    -v ${WORKSPACE}:/ft_workspace \
    --name ft \
    ${TRITON_DEV_DOCKER_IMAGE} \
    /bin/bash

and then go into the container with docker exec -it ft /bin/bash.

Finally I run the binary to start the server

mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model-store

If I start the server very soon after the container is started (likely somewhere < 5s), I will get this error. To solve this locally, I run the server binary again, and it works.

However, if after the container starts, I wait some more time (10s seems to work reliably) before running the binary, this error will not appear.

Searching on the internet this seems like some mismatch of Nvidia drivers (e.g. CUDA) but I find this weird because the drivers should be contained on the Docker image anyways.

I can reproduce this on A100s and T4s as well.

GPTJ end_id usage and behavior

Hello. At first want to thank you a lot for your job and this framework. Really appreciated!
I have a question about generation process with GPT-j.
Everything is set up, generation works well, but end_id

{
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }

seems like it doesnt work as expected for generation.

Lets say I have a prefix for generation:

Jack: Hello buddy
Mike: Hello there!
Jack:

and want my generation process to stop when it reachs end_id = 198 (which is token for \n) => so I want only generate one turn of this "dialogue".

My expected output is

Jack: Hello buddy\nMike: Hello there!\nJack: How are you?

but I get

Jack: Hello buddy\nMike: Hello there!\nJack: How are you?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

and generation continues until max_steps reached

Can you please point at part of code or possible mistake made by me to check that kind of behavior?
Thank you!

After triton fastertransformer backend, the inference speed is severely reduced

Description

After using triton fastertransformer backend, the same model and the same data are much slower than torch code.

model: mt5

Reproduced Steps

result:

torch op:
ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s]
engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004
ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s]
engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284
ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s]
engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286
ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s]
engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587


triton backend:
100%|█████████████████████████████████████████| 976/976 [00:55<00:00, 17.52it/s]
batch: 1, cps:2112.1205511740964, elapsed: 55.705627s bleu: 25.503309091716304
100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.89it/s]
batch: 2, cps:2867.621256393294, elapsed: 41.029477s bleu: 25.531731907295494
100%|█████████████████████████████████████████| 244/244 [00:24<00:00,  9.89it/s]
batch: 4, cps:4770.055153228809, elapsed: 24.665753s bleu: 25.514632321649643
100%|█████████████████████████████████████████| 122/122 [00:13<00:00,  9.22it/s]
batch: 8, cps:8888.158091941497, elapsed: 13.237501s bleu: 25.525217557230196

T5 not performing as expeceted

Description

I am trying to optimize T5-small inference using Fastertransformer. I am running on a single V100, I followed all the steps in `t5_guide.md` exactly and got a sensible BLEU score. And yet, when measuring performance of inference (with the time it takes to set `InputTensor`s of the client etc, the performance boost is far from x22 as promoted in the related blogpost. I was not able to run it using `fp16` as the model is not stable enough (this has been mentioned multiple times in the `transformers` repo.
Am I missing something? Is there a way to run with `fp16` that I am not aware of?

Thanks in advance for your reply,

N

Reproduced Steps

Follow the T5 guide/blogpost.

Config.pbtxt for all_models/t5/fastertransformer incorrect

Description

The Latest faster transformer v5.1.1 which is being used by the Fastertransformer backend latest release prescribes that T5 decoder output - [output_ids and sequence_length] should be int32 type however in the current config.pbtxt its specified as uint32.

https://github.com/NVIDIA/FasterTransformer/blob/release/v5.1.1_tag/docs/t5_guide.md

Please update the config.pbtxt without correct output configurations.

Reproduced Steps

Run T5 in ensemble mode and the post processing logic will indicate that the output should be INT32.

Does FT supports serving multiple models concurrently?

If there are multiple models in the model-repository, how will FT launch model instances? Say there are 4 GPUs in total, I launch the Bert and GPT model each with 1 instance, will they both be placed on the first GPU? Can I control the instance placement policy?

Recommendation for the complete BERT model deployment on Triton + fastertransformer backend

Description

What is the recommended approach for practical complete BERT model deployment on Triton with the faster transformer backend?

So far, I have successfully converted the HF bert-base-uncased model weights using huggingface_bert_convert.py, deployed it to the triton server with fastertransformer backend and completed the BERT identity_test.py. I noticed that the converted model is, in fact, a BERT encoder without model embeddings and a pooler + classification head on top. After watching the Faster Transformer presentation at https://developer.nvidia.com/gtc/2020/video/s21417, I understand why the optimization focused on the BERT encoder. Still, I am baffled on how to use a complete BERT model on Triton with fastertransformer backend practically? Resolving embeddings and invocation of pooling and qa/classification head on top match nicely with the pre/post-processing paradigms. Is that the recommended approach to complete the full inference on, for example, BERT QA model? If so, do pre/post-processing components have access to GPUs? Any other recommendations or guidelines?
TIA

How much VRAM BLOOM consumes?

Hi, thanks for supporting the BLOOM model in the latest release of fastertransformer backend.

I tried the latest code on my 8x A6000 GPU server with 48G ram per GPU (the total ram will be 384GB). After converting the BLOOM model by FasterTransformer/examples/PyTorch/gpt/utility/huggingface_bloom_convert.py with tp=8 and dt=fp16 the resulting checkpoint is about 330G.

But after I ran the tritonserver, the following out of memory error occurs
what(): [FT][ERROR] CUDA runtime error: out of memory /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:32.

I wonder how much memory BLOOM-176b consumes with fastertransformer backend? I can run BLOOM inference on my 8x A6000 server with the HuggingFace package, so it seems that fastertransformer library will allocate memory more than the model itself.

Can't deploy multiple version of BERT.

Description

I tried to deploy multiple versions of BERT.
For that, I removed "default_model_filename" and "model_checkpoint_path" from config.pbtxt.
But when started up triton, I got many warning messages like follows.

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.dense.bias.bin cannot be opened, loading model fails! 

Environments.

fastertransformer_backend: 225b57898b830
FasterTransformer: f59e237c247
Docker version: 20.10.17, build 100c701
NVIDIA Driver Version: 515.65.01
GPU: Tesla T4 x 1

Reproduced Steps

1. clone fastertransformer_backend repo.

$ git clone https://github.com/triton-inference-server/fastertransformer_backend.git
$ cd fastertransformer_backend
$ export WORKSPACE=$(pwd)
$ export CONTAINER_VERSION=22.07
$ export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
  1. Build container image.
$ docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .
  1. Run container.
$ docker run -it --rm --gpus=all -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
  1. Convert Huggingface's BERT (in container)
# export WORKSPACE=$(pwd)

# sudo apt-get install git-lfs
# git lfs install
# git lfs clone https://huggingface.co/bert-base-uncased # Download model from huggingface
# git clone https://github.com/NVIDIA/FasterTransformer.git # To convert checkpoint
# export PYTHONPATH=${WORKSPACE}/FasterTransformer:${PYTHONPATH}
# python3 FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
        -in_file bert-base-uncased/ \
        -saved_dir ${WORKSPACE}/all_models/bert/fastertransformer/1/ \
        -infer_tensor_para_size 1
  1. Modify config.pbtxt (Change "tensor/pipeline_para_size" to 1, and remove "default_model_filename" and "model_checkpoint_path")
# sed -i -e 's/string_value: "2"/string_value: "1"/' -e "30d" -e "88,93d" all_models/bert/fastertransformer/config.pbtxt

# git diff all_models/bert/fastertransformer/config.pbtxt 
diff --git a/all_models/bert/fastertransformer/config.pbtxt b/all_models/bert/fastertransformer/config.pbtxt
index e18d66f..3a8ed02 100644
--- a/all_models/bert/fastertransformer/config.pbtxt
+++ b/all_models/bert/fastertransformer/config.pbtxt
@@ -27,7 +27,6 @@
 
 name: "fastertransformer"
 backend: "fastertransformer"
-default_model_filename: "bert"
 max_batch_size: 1024
 input [
   {
@@ -58,13 +57,13 @@ instance_group [
 parameters {
   key: "tensor_para_size"
   value: {
-    string_value: "2"
+    string_value: "1"
   }
 }
 parameters {
   key: "pipeline_para_size"
   value: {
-    string_value: "2"
+    string_value: "1"
   }
 }
 parameters {
@@ -85,12 +84,6 @@ parameters {
     string_value: "bert"
   }
 }
-parameters {
-  key: "model_checkpoint_path"
-  value: {
-    string_value: "../all_models/bert/fastertransformer/1/2-gpu/"
-  }
-}
 parameters {
   key: "int8_mode"
   value: {
  1. Start Triton and got warnings.
# ls all_models/bert/fastertransformer/
1  config.pbtxt
# ls all_models/bert/fastertransformer/1
1-gpu

# CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/ 
...
[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.LayerNorm.weight.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.attention.self.query.weight.0.bin cannot be opened, loading model fails! 

[FT][WARNING] file /home/uno/fastertransformer_backend/all_models/bert/fastertransformer/1/1-gpumodel.encoder.layer.0.output.dense.bias.bin cannot be opened, loading model fails! 
...

Allow mT5 support alongside T5

Currently mT5 is not fully supported, yielding multiple errors of type:
[ERROR] cannot find key 'encoder.block.0.layer.1.DenseReluDense.wi_0.weight'
when passing a mT5 model to the t5_ckpt_convert.py

Unexpected behavior of batched inference of GPT-J

Description

Inference outputs without batching do not match outputs with batching.

Such mismatching exists if padding is used →

  • batch size > 1
  • there are different lengths in the batch
    • pad_token used

Example

Case 1

Input:

[”My name is”]

Output:

[” Aleksey”]

Case 2

Input:

[”I am from”]

Output:

[” Belarus”]

Case 3

Input:

[”My name is, ”I am from”]

Output:

[” Alex”, ” UK”]

Expected behaviour

Case 3 should have the same outputs as in Case 1 and Case 2.

The following outputs should match:

  • single/no-batch inference
  • batched inference

Relevant issues

This issue should be resolved in FasterTransformers: NVIDIA/FasterTransformer#312

Thoughts

The idea of batched inference is the usage of an attention mask, where such behavior should be handled. But there is no way to pass such parameter as input as well to tell the model pad_token (only bos=start_id and eos=end_id). Assume that such feature as passing attention_masks while inferencing can help a lot.

Also assume that there is a part here that requires changes:

#attn_mask = torch.ones((batch_size, input_len, input_len)).tril()

How to reproduce

Docker image

Image available here: rtalaricw/gptj_ft:v1.2-22.04-new
Image was build on 5th October with main branches of all needed repos.

# Base Image
ARG TRITON_VERSION=22.04
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:${TRITON_VERSION}-py3
FROM ${BASE_IMAGE} as server-builder

# Get NIVIDIA keys to authenticate
RUN export this_distro="$(cat /etc/os-release | grep '^ID=' | awk -F'=' '{print $2}')" \
    && export this_version="$(cat /etc/os-release | grep '^VERSION_ID=' | awk -F'=' '{print $2}' | sed 's/[^0-9]*//g')" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/7fa2af80.pub" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/3bf863cc.pub"

# Run updates and install packages for build
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    openssh-server zsh tmux mosh locales-all clangd sudo \
    zip unzip wget build-essential autoconf autogen gdb \
    python3.8 python3-pip python3-dev rapidjson-dev \
    xz-utils zstd libz-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Setup workdir for build
WORKDIR /workspace/build/

# CMake
RUN CMAKE_VERSION=3.18 && \
    CMAKE_BUILD=3.18.4 && \
    wget -nv https://cmake.org/files/v${CMAKE_VERSION}/cmake-${CMAKE_BUILD}.tar.gz && \
    tar -xf cmake-${CMAKE_BUILD}.tar.gz && \
    cd cmake-${CMAKE_BUILD} && \
    ./bootstrap --parallel=$(grep -c ^processor /proc/cpuinfo) -- -DCMAKE_USE_OPENSSL=OFF && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
    cd /workspace/build/ && \
    rm -rf /workspace/build/cmake-${CMAKE_BUILD}

# backend build
WORKDIR /workspace/build/triton-experiments

RUN echo 2
RUN git clone https://github.com/triton-inference-server/fastertransformer_backend.git
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/cmake /workspace/build/triton-experiments
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/src /workspace/build/triton-experiments
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/CMakeLists.txt /workspace/build/triton-experiments

ARG FORCE_BACKEND_REBUILD=0
RUN mkdir build -p && \
    cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install

# =================================
#  Runner Image
# =================================

FROM ${BASE_IMAGE} as server

ENV NCCL_LAUNCH_MODE=PARALLEL

COPY --from=server-builder /opt/tritonserver/backends/fastertransformer /opt/tritonserver/backends/fastertransformer

Config

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT-J"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/mnt/pvc/triton-model-store/fastertransformer/1"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

Inference to reproduce an issue

/opt/tritonserver/bin/tritonserver --model-repository=YOUR_MODEL_PATH
import time
import numpy as np
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
from datasets import load_dataset

dataset = load_dataset("ChaiML/user_model_inputs")

DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'localhost:8000',
    'model_name': 'fastertransformer',
    'verbose': False,
}

dtype = "uint32"

GENERATION_CONFIG = {
    "request": [
        {
            "name": "input_ids",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "input_lengths",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "request_output_len",
            "data": [[64]],
            "dtype": dtype
        },
        {
            "name": "beam_search_diversity_rate",
            "data": [[0]],
            "dtype": "float32"
        },
        {
            "name": "temperature",
            "data": [[0.72]],
            "dtype": "float32"
        },
        {
            "name": "repetition_penalty",
            "data": [[1.13]],
            "dtype": "float32"
        },
        {
            "name": "beam_width",
            "data": [[1]],
            "dtype": dtype
        },
        {
            "name": "random_seed",
            "data": [[0]],
            "dtype": "int32"
        },
        {
            "name": "runtime_top_k",
            "data": [[0]],
            "dtype": dtype
        },
        {
            "name": "runtime_top_p",
            "data": [[0.725]],
            "dtype": "float32"
        },
        {
            "name": "stop_words_list",
            "data": [[[198], [1]]],
            "dtype": "int32"
        },
        {
            "name": "bad_words_list",
            "data": [[[77, 15249, 77], [2, 5, 7]]],
            "dtype": "int32"
        }
    ]
}

padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
tokenizer.pad_token_id = 50256
assert tokenizer.pad_token_id == 50256, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side

def to_word_list_format(words):
    flat_ids = []
    offsets = []
    item_flat_ids = []
    item_offsets = []

    for word in words:
        ids = tokenizer.encode(word)

        if len(ids) == 0:
            continue

        item_flat_ids += ids
        item_offsets.append(len(ids))

    flat_ids.append(np.array(item_flat_ids))
    offsets.append(np.cumsum(np.array(item_offsets)))

    pad_to = max(1, max(len(ids) for ids in flat_ids))

    for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
        flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
        offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)

    return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))

def load_bad_word_ids():
    forbidden = [
        'samplebadword'
		]

    return to_word_list_format(forbidden)

GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()

def generate_parameters_from_texts(texts, random_seed=None):
    params = deepcopy(GENERATION_CONFIG["request"])
    inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
    input_ids = inputs.input_ids
    for index, value in enumerate(params):

        if value['name'] == 'input_ids':
            data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
        elif value['name'] == 'input_lengths':
            value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids]
            data = np.array([data for data in value_data], dtype=value['dtype'])
        elif value['name'] == 'random_seed':
            if random_seed is None:
                random_seed = random.randint(0, 10000)
            data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
        else:
            data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])

        params[index] = {
            'name': value['name'],
            'data': data,
        }
    return params

def prepare_tensor(client, name, input):
    t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def triton_inference(inference_client, texts, random_seed=None):
    request = generate_parameters_from_texts(texts, random_seed)
    payload = [prepare_tensor(httpclient, field['name'], field['data'])
               for field in request]
    result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
    output_texts = []
    output_texts_cropped = []

    for i, output in enumerate(result.get_response()['outputs']):
        if output['name'] == "output_ids":
            for output_ids in result.as_numpy(output['name']):
                output_ids = [int(output_id) for output_id in list(output_ids[0])]
                output_texts.append(tokenizer.decode(output_ids, skip_special_tokens=True).strip())
                output_texts_cropped.append(
                    tokenizer.decode(
                        output_ids[len(request[0]["data"][i]):], skip_special_tokens=True
                    ).strip()
                )
    return output_texts_cropped

def main():
    client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)

    INPUT_EXAMPLES = dataset["train"]["text"][:2]
    example1 = INPUT_EXAMPLES[0]
    example2 = INPUT_EXAMPLES[1]

    print(
        triton_inference(client, [example1], random_seed=0)
    )

    print(
        triton_inference(client, [example2], random_seed=0)
    )

    print(
        triton_inference(client, [example1, example2], random_seed=0)
    )

if __name__ == "__main__":
    main()

Carbon copy

FasterTransformer might freeze after few requests

Running into an issue where after sending in a few requests in succession, FasterTransformer on Triton will lock up; the logs look like this

W0406 20:15:51.091033 52 libfastertransformer.cc:1092] the data is in CPU
W0406 20:15:51.091045 52 libfastertransformer.cc:1096] the data is in CPU
W0406 20:15:51.091054 52 libfastertransformer.cc:975] before ThreadForward 0
W0406 20:15:51.091200 52 libfastertransformer.cc:977] after ThreadForward 0
W0406 20:15:51.091220 52 libfastertransformer.cc:975] before ThreadForward 1
W0406 20:15:51.091274 52 libfastertransformer.cc:977] after ThreadForward 1
W0406 20:15:51.091288 52 libfastertransformer.cc:975] before ThreadForward 2
W0406 20:15:51.091357 52 libfastertransformer.cc:977] after ThreadForward 2
W0406 20:15:51.091363 52 libfastertransformer.cc:856] Start to forward
W0406 20:15:51.091366 52 libfastertransformer.cc:856] Start to forward
W0406 20:15:51.091374 52 libfastertransformer.cc:975] before ThreadForward 3
W0406 20:15:51.091407 52 libfastertransformer.cc:856] Start to forward
W0406 20:15:51.091517 52 libfastertransformer.cc:977] after ThreadForward 3
W0406 20:15:51.091583 52 libfastertransformer.cc:856] Start to forward

I've left this there for over an hour, and it will still hang. Interestingly, some number of GPUs will still show 100% GPU Util in nvidia-smi. It also is flaky, as it doesn't happen after the same number of requests each time. I am using 4 A100s.

Happy to provide more information as needed.

Can't re-load any T5 model after a first load/unload iteration

Description

Branch: main
GPU: NVIDIA V100S
Docker version: 20.10.16

When re-loading a model that has been previously loaded and unloaded on time.
The backend crash with the following error :

I1019 08:53:38.952721 301 model_repository_manager.cc:997] loading: t5-small:1
*** The MPI_Init_thread() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[cf752fab27ae:00301] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Do you have any idea where does that come from?

Reproduced Steps

1. Start a triton server with `--model-control-mode=explicit`
2. Load a T5 model on the fastertransformer backend (i.e t5-small) `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`
3. Unload the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/unload`
4. Re-load the T5 model `curl -vX POST localhost:8000/v2/repository/models/t5-small/load`

Failed to run FasterTransformer BERT Triton Backend with multiple instances.

Description

main branch, Docker version 20.10.17, Tesla T4 GPU

Reproduced Steps

Preparation:
I follow the steps in the bert_guide document to setup the docker container and convert the model with --infer_tensor_para_size=1. Also I tested it using ${WORKSPACE}/tools/bert/identity_test.py with T=1, P=1 successfully.

Step1: Set the instance_group.count=2 and model_checkpoint_path=./all_models/bert/fastertransformer/1/1-gpu/ in config.pbtxt.
Step2: Launch the container

export WORKSPACE="/home/ubuntu/efs/fastertransformer_backend/"
export TRITON_DOCKER_IMAGE="triton_with_ft:22.07"
export NAME="triton_with_ft"

docker run -it --rm \
           --net=host \
           --gpus=all \
           -v ${WORKSPACE}:${WORKSPACE} \
           -w ${WORKSPACE} \
           -e WORKSPACE=${WORKSPACE} \
           --name ${NAME} \
           ${TRITON_DOCKER_IMAGE} bash

step 3: Launch the triton server in the container

CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/

step 4: Run the identity_test.py

python3 ${WORKSPACE}/tools/bert/identity_test.py \
        --hf_ckpt_path ./bert-base-uncased/ \
        --num_runs 100 \
        --inference_data_type fp16

The error occurred in the client side:

Some weights of the model checkpoint at ./bert-base-uncased/ were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
set request
get results as output_hidden_state

set request
Traceback (most recent call last):
  File "../tools/bert/identity_test.py", line 221, in <module>
    FLAGS.verbose, FLAGS, request_parallelism=2)
  File "../tools/bert/identity_test.py", line 99, in send_requests
    result = client.infer(model_name, inputs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/tritonclient/http/__init__.py", line 1418, in infer
    _raise_if_error(response)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/tritonclient/http/__init__.py", line 65, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: pinned buffer: failed to perform CUDA copy: invalid argument

The log in the triton server is as follows:

W0819 05:38:13.744919 3278 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0_0 with 1 requests
W0819 05:38:13.744927 3278 libfastertransformer.cc:886] get total batch_size = 8
W0819 05:38:13.744934 3278 libfastertransformer.cc:1296] get input count = 2
W0819 05:38:13.745027 3278 libfastertransformer.cc:1368] collect name: input_hidden_state size: 393216 bytes
W0819 05:38:13.745042 3278 libfastertransformer.cc:1368] collect name: sequence_lengths size: 32 bytes
W0819 05:38:13.745047 3278 libfastertransformer.cc:1379] the data is in CPU
W0819 05:38:13.745052 3278 libfastertransformer.cc:1386] the data is in CPU
W0819 05:38:13.745062 3278 libfastertransformer.cc:1244] before ThreadForward 0
W0819 05:38:13.745104 3278 libfastertransformer.cc:1252] after ThreadForward 0
I0819 05:38:13.745126 3278 libfastertransformer.cc:1090] Start to forward
I0819 05:38:13.746714 3278 libfastertransformer.cc:1098] Stop to forward
W0819 05:38:13.746838 3278 libfastertransformer.cc:1411] Get output_tensors 0: output_hidden_state
W0819 05:38:13.746859 3278 libfastertransformer.cc:1421]     output_type: FP16
W0819 05:38:13.746866 3278 libfastertransformer.cc:1443]     output shape: [8, 32, 768]
W0819 05:38:13.747001 3278 libfastertransformer.cc:1458] PERFORMED GPU copy: NO
W0819 05:38:13.747012 3278 libfastertransformer.cc:1001] get response size = 1
W0819 05:38:13.747046 3278 libfastertransformer.cc:1016] response is sent
W0819 05:38:18.756127 3278 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0_1, executing 1 requests
W0819 05:38:18.756161 3278 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0_1 with 1 requests
W0819 05:38:18.756175 3278 libfastertransformer.cc:886] get total batch_size = 8
W0819 05:38:18.756200 3278 libfastertransformer.cc:1296] get input count = 2
W0819 05:38:18.756355 3278 libfastertransformer.cc:1368] collect name: input_hidden_state size: 393216 bytes
W0819 05:38:18.756376 3278 libfastertransformer.cc:1368] collect name: sequence_lengths size: 32 bytes
W0819 05:38:18.756386 3278 libfastertransformer.cc:1379] the data is in CPU
W0819 05:38:18.756394 3278 libfastertransformer.cc:1386] the data is in CPU
W0819 05:38:18.756428 3278 libfastertransformer.cc:1244] before ThreadForward 1
W0819 05:38:18.756491 3278 libfastertransformer.cc:1252] after ThreadForward 1
I0819 05:38:18.756579 3278 libfastertransformer.cc:1090] Start to forward
I0819 05:38:18.758206 3278 libfastertransformer.cc:1098] Stop to forward
W0819 05:38:18.758268 3278 libfastertransformer.cc:1411] Get output_tensors 0: output_hidden_state
W0819 05:38:18.758285 3278 libfastertransformer.cc:1421]     output_type: FP16
W0819 05:38:18.758296 3278 libfastertransformer.cc:1443]     output shape: [8, 32, 768]
W0819 05:38:18.758463 3278 libfastertransformer.cc:1458] PERFORMED GPU copy: NO
W0819 05:38:18.758477 3278 libfastertransformer.cc:1001] get response size = 1
W0819 05:38:18.758484 3278 libfastertransformer.cc:1019] response is nullptr

Is there any kind of caching?

Hello.

Description

Difference in performance in cases of similar requests vs different requests.
I use custom client and do performance testing of GPTJ-6B model.
In case when i use only one string (just mocked while prototyping) as prefix for generation (same for all requests), I get really good performance:

(below is output of https://github.com/wg/wrk tool for performance testing)
10 threads, prefix length is 512, 50 tokens to be generated.

Latency Distribution
     50%    2.24s 
     75%    2.25s 
     90%    2.25s 
     99%    2.27s 

But if I use different strings as a prefix, it looks like queue of requests if growing fast, and I get like:

Latency Distribution
     50%    3.17s 
     75%    3.28s 
     90%    4.12s 
     99%    5.89s 

Environment

I use custom docker container, based on nvcr.io/nvidia/tritonserver:22.03-py3 with fastertransformer_backend built.
Also I use ensemble setup from here: https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gptj
The model checkpoint converted and set as it described here: https://developer.nvidia.com/blog/deploying-gpt-j-and-t5-with-fastertransformer-and-triton-inference-server/

Thoughts

I noticed, that when I use only one string, there is more consistent batches I can see in triton-server logs:

I1013 20:56:40.491431 567 libfastertransformer.cc:886] get total batch_size = 10
I1013 20:56:40.491448 567 libfastertransformer.cc:1296] get input count = 16

But when I use different feed, I see a lot of partial batches of my 10-threads client feed, like:

I1013 20:56:40.491431 567 libfastertransformer.cc:886] get total batch_size = 3
I1013 20:56:40.491448 567 libfastertransformer.cc:1296] get input count = 16
I1013 20:56:40.491431 567 libfastertransformer.cc:886] get total batch_size = 4
I1013 20:56:40.491448 567 libfastertransformer.cc:1296] get input count = 16
I1013 20:56:40.491431 567 libfastertransformer.cc:886] get total batch_size = 1
I1013 20:56:40.491448 567 libfastertransformer.cc:1296] get input count = 16

etc.

Is there any kind of cache in fastertransformer framework, that can give that kind of behaviour?
Or, maybe, low performance on different inputs can be caused by something else?
Thank you in advance!

Segmentation fault: address not mapped to object at address (nil)

Description
When Triton server with FasterTransformer starts with loading a model, segmentation fault happens.
This error does not happen without load of any models.

Triton Information

  • NVIDIA driver: 510.47.03
  • Triton sever version: 22.03

To Reproduce

  1. Overall, the following procedure "Prepare docker images" to build.
    https://github.com/triton-inference-server/fastertransformer_backend#setup

Before build, I changed the Dockerfile as

# line 22
ARG TRITON_VERSION=22.01 -> 22.03

# before line 26 and line 81(before apt-get update)
RUN apt-key del 7fa2af80
RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
  1. prepare for gpt3 model(omit details. I confirm that old Triton+FT can load prepared model without any problems)

  2. run the build image and run triton

The error message is as follows:

[cc3a89043c6d:105  :0:108] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:    108) ====
 0 0x00000000000143c0 __funlockfile()  ???:0
 1 0x000000000001e429 triton::backend::fastertransformer_backend::ModelInstanceState::ModelInstanceState()  :0
 2 0x000000000001fa02 triton::backend::fastertransformer_backend::ModelInstanceState::Create()  :0
 3 0x00000000000222fc TRITONBACKEND_ModelInstanceInitialize()  ???:0
 4 0x000000000030fece nvidia::inferenceserver::TritonModelInstance::CreateInstance()  :0
 5 0x0000000000311493 nvidia::inferenceserver::TritonModelInstance::CreateInstances()  :0
 6 0x0000000000309147 nvidia::inferenceserver::TritonModel::Create()  :0
 7 0x000000000018c37a nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::CreateModel()  :0
 8 0x000000000019a351 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvidia::inferenceserver::Status (nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::ModelInfo*), nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, long, nvidia::inferenceserver::ModelRepositoryManager::ModelLifeCycle::ModelInfo*> > >::_M_run()  :0
 9 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
10 0x0000000000008609 start_thread()  ???:0
11 0x000000000011f163 clone()  ???:0
=================================
[cc3a89043c6d:00105] *** Process received signal ***
[cc3a89043c6d:00105] Signal: Segmentation fault (11)
[cc3a89043c6d:00105] Signal code:  (-6)
[cc3a89043c6d:00105] Failing at address: 0x69
[cc3a89043c6d:00105] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f2b81bb33c0]
[cc3a89043c6d:00105] [ 1] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x1e429)[0x7f2ab01b7429]
[cc3a89043c6d:00105] [ 2] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(+0x1fa02)[0x7f2ab01b8a02]
[cc3a89043c6d:00105] [ 3] /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so(TRITONBACKEND_ModelInstanceInitialize+0x38c)[0x7f2ab01bb2fc]
[cc3a89043c6d:00105] [ 4] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x30fece)[0x7f2b81ee3ece]
[cc3a89043c6d:00105] [ 5] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x311493)[0x7f2b81ee5493]
[cc3a89043c6d:00105] [ 6] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x309147)[0x7f2b81edd147]
[cc3a89043c6d:00105] [ 7] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18c37a)[0x7f2b81d6037a]
[cc3a89043c6d:00105] [ 8] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a351)[0x7f2b81d6e351]
[cc3a89043c6d:00105] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2b8172ade4]
[cc3a89043c6d:00105] [10] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2b81ba7609]
[cc3a89043c6d:00105] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2b81415163]
[cc3a89043c6d:00105] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cc3a89043c6d exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The above Error does not happen without loading any models.

root@2367819e4209:/workspace# mkdir /a
root@2367819e4209:/workspace# ls /a
root@2367819e4209:/workspace# mpirun -n 1 --allow-run-as-root tritonserver --model-repository=/a
I0510 09:42:32.681141 115 libtorch.cc:1309] TRITONBACKEND_Initialize: pytorch
I0510 09:42:32.681269 115 libtorch.cc:1319] Triton TRITONBACKEND API version: 1.8
I0510 09:42:32.681278 115 libtorch.cc:1325] 'pytorch' TRITONBACKEND API version: 1.8
2022-05-10 09:42:32.850252: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-05-10 09:42:32.889780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0510 09:42:32.889847 115 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0510 09:42:32.889869 115 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0510 09:42:32.889874 115 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0510 09:42:32.889878 115 tensorflow.cc:2216] backend configuration:
{}
I0510 09:42:32.891282 115 onnxruntime.cc:2319] TRITONBACKEND_Initialize: onnxruntime
I0510 09:42:32.891308 115 onnxruntime.cc:2329] Triton TRITONBACKEND API version: 1.8
I0510 09:42:32.891313 115 onnxruntime.cc:2335] 'onnxruntime' TRITONBACKEND API version: 1.8
I0510 09:42:32.891317 115 onnxruntime.cc:2365] backend configuration:
{}
I0510 09:42:32.908946 115 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0510 09:42:32.908968 115 openvino.cc:1217] Triton TRITONBACKEND API version: 1.8
I0510 09:42:32.908973 115 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.8
I0510 09:42:33.286687 115 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f55ee000000' with size 268435456
I0510 09:42:33.287001 115 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0510 09:42:33.287551 115 server.cc:524]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0510 09:42:33.287603 115 server.cc:551]
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0510 09:42:33.287623 115 server.cc:594]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0510 09:42:33.322839 115 metrics.cc:651] Collecting metrics for GPU 0: Tesla V100-PCIE-16GB
I0510 09:42:33.324668 115 tritonserver.cc:1962]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.20.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /a                                                                                                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0510 09:42:33.326424 115 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001
I0510 09:42:33.326672 115 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000
I0510 09:42:33.368908 115 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Expected behavior
Triton + FT can load models successfully.

Multi-instance inference fails in (n-1)/n runs (where n is a number gpus/instances)

Hello. Than you for your work and framework!

My goal is to host n instances of GPTJ-6B on N graphic cards. I want to have N instances with one model in each.
My setup is 3x3090 (one more host has 5x3090, but everything else is similar), docker container built according the repository readme, and fastertransformer config looking like that (for 3gpu setup):

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 8

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 3
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT-J"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/workspace/models/j6b_ckpt/1-gpu"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}
dynamic_batching {
  max_queue_delay_microseconds: 300000
}

I start triton using command:
CUDA_VISIBLE_DEVICES="0,1,2" /opt/tritonserver/bin/tritonserver --http-port 8282 --log-verbose 10 --model-repository=/workspace/triton-models

I face the issue which I can describe that way:
I see that 3 instances running on 0-1-2 gpus. But when I request inference, I get correct generation result in 1/3 cases:

I1026 11:01:20.566547 93 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0_0, executing 1 requests
I1026 11:01:20.566598 93 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0_0 with 1 requests
I1026 11:01:20.566614 93 libfastertransformer.cc:886] get total batch_size = 1
I1026 11:01:20.566632 93 libfastertransformer.cc:1296] get input count = 14
I1026 11:01:20.566660 93 libfastertransformer.cc:1368] collect name: stop_words_list size: 8 bytes
I1026 11:01:20.566683 93 libfastertransformer.cc:1368] collect name: input_lengths size: 4 bytes
I1026 11:01:20.566701 93 libfastertransformer.cc:1368] collect name: request_output_len size: 4 bytes
I1026 11:01:20.566718 93 libfastertransformer.cc:1368] collect name: temperature size: 4 bytes
I1026 11:01:20.566735 93 libfastertransformer.cc:1368] collect name: random_seed size: 4 bytes
I1026 11:01:20.566751 93 libfastertransformer.cc:1368] collect name: bad_words_list size: 8 bytes
I1026 11:01:20.566766 93 libfastertransformer.cc:1368] collect name: runtime_top_k size: 4 bytes
I1026 11:01:20.566782 93 libfastertransformer.cc:1368] collect name: runtime_top_p size: 4 bytes
I1026 11:01:20.566799 93 libfastertransformer.cc:1368] collect name: input_ids size: 2048 bytes
I1026 11:01:20.566815 93 libfastertransformer.cc:1368] collect name: start_id size: 4 bytes
I1026 11:01:20.566831 93 libfastertransformer.cc:1368] collect name: end_id size: 4 bytes
I1026 11:01:20.566845 93 libfastertransformer.cc:1368] collect name: beam_width size: 4 bytes
I1026 11:01:20.566860 93 libfastertransformer.cc:1368] collect name: beam_search_diversity_rate size: 4 bytes
I1026 11:01:20.566876 93 libfastertransformer.cc:1368] collect name: repetition_penalty size: 4 bytes
I1026 11:01:20.566896 93 libfastertransformer.cc:1379] the data is in CPU
I1026 11:01:20.566908 93 libfastertransformer.cc:1386] the data is in CPU
I1026 11:01:20.566945 93 libfastertransformer.cc:1244] before ThreadForward 0
I1026 11:01:20.567101 93 libfastertransformer.cc:1252] after ThreadForward 0
I1026 11:01:20.567169 93 libfastertransformer.cc:1090] Start to forward
I1026 11:01:22.420238 93 libfastertransformer.cc:1098] Stop to forward
I1026 11:01:22.420410 93 libfastertransformer.cc:1411] Get output_tensors 0: output_ids
I1026 11:01:22.420460 93 libfastertransformer.cc:1421]     output_type: UINT32
I1026 11:01:22.420477 93 libfastertransformer.cc:1443]     output shape: [1, 1, 612]
I1026 11:01:22.420506 93 infer_response.cc:166] add response output: output: output_ids, type: UINT32, shape: [1,1,612]
I1026 11:01:22.420539 93 http_server.cc:1068] HTTP: unable to provide 'output_ids' in GPU, will use CPU
I1026 11:01:22.420560 93 http_server.cc:1088] HTTP using buffer for: 'output_ids', size: 2448, addr: 0x7fdf67d28ca0
I1026 11:01:22.420579 93 pinned_memory_manager.cc:161] pinned memory allocation: size 2448, addr 0x7fe3ea000090
I1026 11:01:22.420663 93 libfastertransformer.cc:1411] Get output_tensors 1: sequence_length
I1026 11:01:22.420676 93 libfastertransformer.cc:1421]     output_type: INT32
I1026 11:01:22.420688 93 libfastertransformer.cc:1443]     output shape: [1, 1]
I1026 11:01:22.420703 93 infer_response.cc:166] add response output: output: sequence_length, type: INT32, shape: [1,1]
I1026 11:01:22.420718 93 http_server.cc:1068] HTTP: unable to provide 'sequence_length' in GPU, will use CPU
I1026 11:01:22.420732 93 http_server.cc:1088] HTTP using buffer for: 'sequence_length', size: 4, addr: 0x7fd949e73600
I1026 11:01:22.420745 93 pinned_memory_manager.cc:161] pinned memory allocation: size 4, addr 0x7fe3ea000a30
I1026 11:01:22.420785 93 libfastertransformer.cc:1458] PERFORMED GPU copy: NO
I1026 11:01:22.420797 93 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fe3ea000090
I1026 11:01:22.420810 93 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fe3ea000a30
I1026 11:01:22.420829 93 libfastertransformer.cc:1001] get response size = 1
I1026 11:01:22.420902 93 http_server.cc:1140] HTTP release: size 2448, addr 0x7fdf67d28ca0
I1026 11:01:22.420917 93 http_server.cc:1140] HTTP release: size 4, addr 0x7fd949e73600
I1026 11:01:22.420930 93 libfastertransformer.cc:1016] response is sent

But in 2/3 cases (which are not routed to fastertransformer_0_0, but fastertransformer_0_1 / fastertransformer_0_2) I ger the following:

I1026 11:01:40.172315 93 libfastertransformer.cc:1647] model fastertransformer, instance fastertransformer_0_2, executing 1 requests
I1026 11:01:40.172367 93 libfastertransformer.cc:812] TRITONBACKEND_ModelExecute: Running fastertransformer_0_2 with 1 requests
I1026 11:01:40.172393 93 libfastertransformer.cc:886] get total batch_size = 1
I1026 11:01:40.172417 93 libfastertransformer.cc:1296] get input count = 14
I1026 11:01:40.172446 93 libfastertransformer.cc:1368] collect name: stop_words_list size: 8 bytes
I1026 11:01:40.172473 93 libfastertransformer.cc:1368] collect name: input_lengths size: 4 bytes
I1026 11:01:40.172494 93 libfastertransformer.cc:1368] collect name: request_output_len size: 4 bytes
I1026 11:01:40.172515 93 libfastertransformer.cc:1368] collect name: temperature size: 4 bytes
I1026 11:01:40.172538 93 libfastertransformer.cc:1368] collect name: random_seed size: 4 bytes
I1026 11:01:40.172554 93 libfastertransformer.cc:1368] collect name: bad_words_list size: 8 bytes
I1026 11:01:40.172569 93 libfastertransformer.cc:1368] collect name: runtime_top_k size: 4 bytes
I1026 11:01:40.172584 93 libfastertransformer.cc:1368] collect name: runtime_top_p size: 4 bytes
I1026 11:01:40.172607 93 libfastertransformer.cc:1368] collect name: input_ids size: 2048 bytes
I1026 11:01:40.172624 93 libfastertransformer.cc:1368] collect name: start_id size: 4 bytes
I1026 11:01:40.172639 93 libfastertransformer.cc:1368] collect name: end_id size: 4 bytes
I1026 11:01:40.172659 93 libfastertransformer.cc:1368] collect name: beam_width size: 4 bytes
I1026 11:01:40.172681 93 libfastertransformer.cc:1368] collect name: beam_search_diversity_rate size: 4 bytes
I1026 11:01:40.172703 93 libfastertransformer.cc:1368] collect name: repetition_penalty size: 4 bytes
I1026 11:01:40.172726 93 libfastertransformer.cc:1379] the data is in CPU
I1026 11:01:40.172743 93 libfastertransformer.cc:1386] the data is in CPU
I1026 11:01:40.172785 93 libfastertransformer.cc:1244] before ThreadForward 2
I1026 11:01:40.172948 93 libfastertransformer.cc:1252] after ThreadForward 2
I1026 11:01:40.173059 93 libfastertransformer.cc:1090] Start to forward
I1026 11:01:42.007168 93 libfastertransformer.cc:1098] Stop to forward
I1026 11:01:42.007373 93 libfastertransformer.cc:1411] Get output_tensors 0: output_ids
I1026 11:01:42.007431 93 libfastertransformer.cc:1421]     output_type: UINT32
I1026 11:01:42.007449 93 libfastertransformer.cc:1443]     output shape: [1, 1, 612]
I1026 11:01:42.007468 93 infer_response.cc:166] add response output: output: output_ids, type: UINT32, shape: [1,1,612]
I1026 11:01:42.007500 93 http_server.cc:1068] HTTP: unable to provide 'output_ids' in GPU, will use CPU
I1026 11:01:42.007523 93 http_server.cc:1088] HTTP using buffer for: 'output_ids', size: 2448, addr: 0x7fe2d40a94b0
I1026 11:01:42.007542 93 pinned_memory_manager.cc:161] pinned memory allocation: size 2448, addr 0x7fe3ea000090
I1026 11:01:42.007616 93 http_server.cc:1140] HTTP release: size 2448, addr 0x7fe2d40a94b0
I1026 11:01:42.007637 93 libfastertransformer.cc:1411] Get output_tensors 1: sequence_length
I1026 11:01:42.007648 93 libfastertransformer.cc:1421]     output_type: INT32
I1026 11:01:42.007661 93 libfastertransformer.cc:1443]     output shape: [1, 1]
I1026 11:01:42.007673 93 libfastertransformer.cc:1458] PERFORMED GPU copy: NO
I1026 11:01:42.007686 93 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fe3ea000090
I1026 11:01:42.007701 93 libfastertransformer.cc:1001] get response size = 1
W1026 11:01:42.007713 93 libfastertransformer.cc:1019] response is nullptr

And for triton-http-client (python one) accordingly:
tritonclient.utils.InferenceServerException: pinned buffer: failed to perform CUDA copy: invalid argument

Is there any error in my setup and pipeline? What is correct way to setup N models on N gpus?

Thank you in advance!

[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len

Description

branch: fastertransformer_backend-release-v1.2.1_tag/
triton with ft container verion: 22.07
gpu:v100 
model:huggingface t5-base

Reproduced Steps

After build the docker , then I use the huggingface origin t5-base , then i convert the model:
1,Convert model: 
python  ./build/fastertransformer_backend/build/_deps/repo-ft-src/examples/pytorch/t5/utils/t5_ckpt_convert.py   -o   /workspace/build/fastertransformer_backend/all_models/t5/fastertransformer/1 -i /FT5/t5-base/ -infer_gpu_num 1

2,start model 
export CUDA_VISIBLE_DEVICES=6
/workspace/build/fastertransformer_backend/all_models/t5/fastertransformer# mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/workspace/build/fastertransformer_backend/all_models/t5

the i met this error:
I1115 06:40:44.582619 7292 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fbbd4000000' with size 268435456
I1115 06:40:44.583982 7292 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1115 06:40:44.595437 7292 model_repository_manager.cc:1206] loading: fastertransformer:1
I1115 06:40:44.685154 7292 libfastertransformer.cc:1478] TRITONBACKEND_Initialize: fastertransformer
I1115 06:40:44.685175 7292 libfastertransformer.cc:1488] Triton TRITONBACKEND API version: 1.10
I1115 06:40:44.685180 7292 libfastertransformer.cc:1494] 'fastertransformer' TRITONBACKEND API version: 1.10
I1115 06:40:44.685213 7292 libfastertransformer.cc:1526] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I1115 06:40:44.686569 7292 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I1115 06:40:44.686588 7292 libfastertransformer.cc:248] Sequence Batching: disabled
[ERROR] Does not find the section encoder with name relative_attention_num_buckets_or_max_pos_seq_len. 
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35452,1],0]
  Exit code:    255
--------------------------------------------------------------------------

If i set relative_attention_num_buckets_or_max_pos_seq_len=32 in config.ini, then i will met this error:
[ERROR] Does not find the section encoder with name weight_data_type.

FasterTransformer freezes on 4 GPUs while running GPT with NCCL_LAUNCH_MODE=GROUP

I'm running triton inference server on a server with 4 GPUs (no pipeline parallelism). Following the GPT guide, I can run inference with tensor parallelism = 2 (so only using 2 of the GPUs). However, if I follow the same steps but instead run with 4 GPUs in tensor parallelism, any single inference I run freezes similar to in #19, but when I'm running with NCCL_LAUNCH_MODE is GROUP or PARALLEL.

The GPUs will also show full utilization (according to nvidia-smi) until I kill the container, potentially even hours later, long after the timeout window. The server doesn't think any requests are in flight at the time for any models though.

logs:

W0706 20:19:47.394019 877 libfastertransformer.cc:999] before ThreadForward 0
W0706 20:19:47.394158 877 libfastertransformer.cc:1006] after ThreadForward 0
W0706 20:19:47.394177 877 libfastertransformer.cc:999] before ThreadForward 1
W0706 20:19:47.394287 877 libfastertransformer.cc:1006] after ThreadForward 1
W0706 20:19:47.394303 877 libfastertransformer.cc:999] before ThreadForward 2
I0706 20:19:47.394317 877 libfastertransformer.cc:834] Start to forward
I0706 20:19:47.394388 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394424 877 libfastertransformer.cc:1006] after ThreadForward 2
W0706 20:19:47.394444 877 libfastertransformer.cc:999] before ThreadForward 3
I0706 20:19:47.394530 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394565 877 libfastertransformer.cc:1006] after ThreadForward 3
I0706 20:19:47.394651 877 libfastertransformer.cc:834] Start to forward  

FasterTransformer freezes on 4 GPUs while running GPT with NCCL_LAUNCH_MODE=GROUP

Description

Branch: main v1.1
Docker: Docker version 20.10.17, build 100c701
GPU: 4x NVIDIA RTX A600

I'm running triton inference server on a server with 4 GPUs (no pipeline parallelism). Following the GPT guide, I can run inference with tensor parallelism = 2 (so only using 2 of the GPUs). However, if I follow the same steps but instead run with 4 GPUs in tensor parallelism, any single inference I run freezes similar to in #19, but when I'm running with NCCL_LAUNCH_MODE is GROUP or PARALLEL.

The GPUs will also show full utilization (according to nvidia-smi) until I kill the container, potentially even hours later, long after the timeout window. The server doesn't think any requests are in flight at the time for any models though.

logs:

W0706 20:19:47.394019 877 libfastertransformer.cc:999] before ThreadForward 0
W0706 20:19:47.394158 877 libfastertransformer.cc:1006] after ThreadForward 0
W0706 20:19:47.394177 877 libfastertransformer.cc:999] before ThreadForward 1
W0706 20:19:47.394287 877 libfastertransformer.cc:1006] after ThreadForward 1
W0706 20:19:47.394303 877 libfastertransformer.cc:999] before ThreadForward 2
I0706 20:19:47.394317 877 libfastertransformer.cc:834] Start to forward
I0706 20:19:47.394388 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394424 877 libfastertransformer.cc:1006] after ThreadForward 2
W0706 20:19:47.394444 877 libfastertransformer.cc:999] before ThreadForward 3
I0706 20:19:47.394530 877 libfastertransformer.cc:834] Start to forward
W0706 20:19:47.394565 877 libfastertransformer.cc:1006] after ThreadForward 3
I0706 20:19:47.394651 877 libfastertransformer.cc:834] Start to forward

Reproduced Steps

1. docker run -it --rm --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 --shm-size 256m -v ${WORKSPACE}:/workspace ${TRITON_DOCKER_IMAGE} bash
2.  CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=$PWD/fastertransformer_backend/all_models/gpt/ &
3. python3 /workspace/fastertransformer_backend/tools/identity_test.py

T5: Triton Model Repository (containing model weights and configuration) on S3 doesn't work as expected

Description

It appears that Triton Server with Faster transformer backend doesn't work as expected when loading the model repository from S3 (containing both configuration and model weights). 

Release: v1.2
GPU: V100
Command used to invoke Triton Server: 
`CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver --model-repository s3://*/users/dhaval-doshi/t5-3b/triton-model-store/t5/ --log-info true`

The invocation fails with the below error:

[ERROR] Can't load '/tmp/folderXaegJB/1/t5/config.ini'
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR]  Assertion fail: /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/triton_backend/t5/T5TritonModel.cc:91 

The model repository structure on S3 is as follows:

S3://*/triton-model-store/t5/fastertransformer/config.pbtxt
S3://*/triton-model-store/t5/fastertransformer/1/<weight files and config.ini files>

The above structure is inline with how the model repository is created in this repo for T5 in this path:
all_models/t5/fastertransformer/…

Details:

It looks like when you start triton server with S3 path to model repository it downloads the contents to the docker container on the startup to a temp folder:

/tmp/folderXaegJB/

ls /tmp/folderXaegJB/
1  config.pbtxt

Which is basically the contents from s3 model repository from the directory: 3://*/triton-model-store/t5/fastertransformer.

However when triton tries to construct model_checkpoint_path to pass it to FT for loading T5 using the below line of code:

{RepositoryPath(), std::to_string(Version()), model_filename})

It basically constructs the below path which of course doesn’t exist.

/tmp/folderXaegJB/1/t5/config.ini

Hence there is inconsistency in how model repository is expected to be structured and how you download and resolve it from S3.

I cannot explicitly pass model_checkpoint_path because triton downloads all this from s3 into a temp folder which I don’t know before hand which temp folder it would be.

Note: It also appears that Faster transformer backend's model repository structure is different from the model repository guidance provided here:
https://github.com/triton-inference-server/server/blob/e9ef15b0fc06d45ceca28861c98b31d0e7f9ee79/docs/user_guide/model_repository.md

The faster transformer backend and ensemble models also expect you to put files under \fastertransformer<version>\weights and \fastertransformer\config.pbtxt.

Please help investigate this issue.



### Reproduced Steps

```shell
1. Upload the model weights and model repository to S3 bucket. You can copy the model repository in this repo and upload the weights of 1-gpu (after running the conversion script) inside the \1 folder.
2. Run the triton with command shown in the description above.

T5 cross_attention output cannot be accessed

Description

As defined in the fastertransformers T5 guide there is an output value for cross_attentions. I cannot find any way of returning cross_attentions on fastertransformers Triton backend for T5.

For reference:

  • Output of T5 Decoding
Name Tensor/Parameter Shape Location Data Type Description
output_ids [batch_size, beam_width, max_output_seq_len] GPU int The output ids. It contains the input_ids and generated ids
sequence_length [batch_size, beam_width] GPU int The lengths of output ids
output_log_probs [batch_size, beam_width, request_output_seq_len] GPU float Optional. It records the log probability of logits at each step for sampling.
cum_log_probs [batch_size, beam_width] GPU float Optional. Cumulative log probability of generated sentences
cross_attentions [num_layer / pipeline_para_size, batch_size, beam_width, head_num / tensor_para_size, max_seq_len, mem_max_seq_len] GPU float Optional. The attention scores of cross attention

t5_guide.md shows 0 BLEU score

Description

https://github.com/triton-inference-server/fastertransformer_backend/tree/main/docs
I followed the guidelines in "t5_guide.md", but I got a BLEU score of 0.
Could you please help with this issue? thank you!

Reproduced Steps

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.07
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION} 
docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .
docker run -it --rm --gpus=all -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
# now in docker
export WORKSPACE=$(pwd)

git lfs clone https://huggingface.co/t5-small
git clone https://github.com/NVIDIA/FasterTransformer.git # To convert checkpoint
python3 FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -in_file t5-small/ \
        -saved_dir ${WORKSPACE}/all_models/t5/fastertransformer/1/ \
        -inference_tensor_para_size 1

git clone https://github.com/triton-inference-server/fastertransformer_backend
cp fastertransformer_backend/all_models/t5/fastertransformer/config.pdtxt \
/workspace/all_models/t5/fastertransformer

"changed the value of the "model_checkpoint_path" in config.pd.txt with "./all_models/t5/fastertransformer/1/1-gpu"

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/t5/ &
python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

I got the results below
get request
       bleu score:   0.00
       bleu counts: [2, 0, 0, 0]
       bleu totals: [372090, 369086, 366082, 363078]
       bleu precisions: [0.0005375043672229836, 0.00013546978211040244, 6.829071082435082e-05, 3.44278639851492e-05]
       bleu sys_len: 372090; ref_len: 61287
[INFO] ft_triton translates 94 batches taking 33.08 sec to translate 372090 tokens, BLEU score: 0.00, 11250 tokens/sec.

Pipeline parallelism does not work for FasterTransformer BERT Triton Backend.

Description

main branch, Docker version 20.10.17, Tesla T4 GPU

Reproduced Steps

Preparation:
I follow the steps in the bert_guide document to setup the docker container and convert the model with --infer_tensor_para_size=1. Also I tested it using ${WORKSPACE}/tools/bert/identity_test.py with T=1, P=1 successfully.

Step1: Set the pipeline_para_size=2 and model_checkpoint_path=./all_models/bert/fastertransformer/1/1-gpu/ in config.pbtxt.

Step2: Launch the container

export WORKSPACE="/home/ubuntu/efs/fastertransformer_backend/"
export TRITON_DOCKER_IMAGE="triton_with_ft:22.07"
export NAME="triton_with_ft"

docker run -it --rm \
           --net=host \
           --gpus=all \
           -v ${WORKSPACE}:${WORKSPACE} \
           -w ${WORKSPACE} \
           -e WORKSPACE=${WORKSPACE} \
           --name ${NAME} \
           ${TRITON_DOCKER_IMAGE} bash

step 3: Launch the triton server in the container

CUDA_VISIBLE_DEVICES=0,1 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=${WORKSPACE}/all_models/bert/

Then the error occurs as follows:

I0819 02:57:50.658966 5836 libfastertransformer.cc:218] Instance group type: KIND_CPU count: 1
I0819 02:57:50.658975 5836 libfastertransformer.cc:248] Sequence Batching: disabled
I0819 02:57:50.660078 5836 libfastertransformer.cc:420] Before Loading Weights:
after allocation    : free: 14.37 GB, total: 14.56 GB, used:  0.19 GBI0819 02:57:51.427084 5836 libfastertransformer.cc:430] After Loading Weights:
after allocation    : free: 14.13 GB, total: 14.56 GB, used:  0.43 GBW0819 02:57:51.427218 5836 libfastertransformer.cc:478] skipping model configuration auto-complete for 'fastertransformer': not supported for fastertransformer backendW0819 02:57:51.431853 5836 libfastertransformer.cc:651] Model name bert
W0819 02:57:51.431868 5836 libfastertransformer.cc:661] Use COUPLED (classic) API.W0819 02:57:51.431876 5836 libfastertransformer.cc:756] Get input name: input_hidden_state, type: TYPE_FP16, shape: [-1, -1]
W0819 02:57:51.431881 5836 libfastertransformer.cc:756] Get input name: sequence_lengths, type: TYPE_INT32, shape: [1]W0819 02:57:51.431903 5836 libfastertransformer.cc:798] Get output name: output_hidden_state, type: TYPE_FP16, shape: [-1, -1][ip-172-31-56-220:05836] *** Process received signal ***
[ip-172-31-56-220:5836 :0:5944] Caught signal 7 (Bus error: nonexistent physical address)[ip-172-31-56-220:05836] Signal: Bus error (7)
[ip-172-31-56-220:05836] Signal code: Non-existant physical address (2)[ip-172-31-56-220:05836] Failing at address: 0x7f843b6cf000
[ip-172-31-56-220:05836] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f848694d420][ip-172-31-56-220:05836] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f8485483b41]
[ip-172-31-56-220:05836] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6929c)[0x7f83c555729c]
[ip-172-31-56-220:05836] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b9ae)[0x7f83c55599ae]
[ip-172-31-56-220:05836] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x50853)[0x7f83c553e853]
[ip-172-31-56-220:05836] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x417b4)[0x7f83c552f7b4]
[ip-172-31-56-220:05836] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x42c4d)[0x7f83c5530c4d]
[ip-172-31-56-220:05836] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x58b37)[0x7f83c5546b37]
[ip-172-31-56-220:05836] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f8486941609]
[ip-172-31-56-220:05836] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f8485417133]
[ip-172-31-56-220:05836] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-56-220 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

Can you shader data.json to run perf_analyzer?

Description

I have written like below.

{
    "data":
    [
        {
            "input_ids" : [9915, 27221, 59, 77, 383, 1853, 3327, 1462],
            "input_lengths" : [8],
            "request_output_len" : [128],
            "beam_search_diversity_rate" : [0],
            "temperature" : [1.0],
            "len_penalty": [1.0],
            "repetition_penalty": [1.0],
            "random_seed": [0],
            "is_return_log_probs": [true],
            "beam_width": [1],
            "runtime_top_k": [1.0],
            "runtime_top_p": [1.0],
            "start_id": [0],
            "end_id": [1],
            "bad_words_list": [[0], [-1]],
            "stop_words_list": [[32094], [0]]
        }
    ]
}

but it returns error: failed to create concurrency manager: unable to find int32_t data in json.
If you share this, It will be very helpful.

Not getting response with warning "response is nullptr"

Description

The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash.

The model is a pretty standard BERT model, downloaded from here, and then converted using huggingface_bert_convert.py script. The model is deployed with Triton inference server with ft_backend enabled.

In config.pbtxt, the input sequence length is modified to a fixed value 384, and the hidden state dim to 768 (according to bert_model.config)

perf_analyzer is used for benchmarking with custom generated data. This issue happens when "dynamic_batching {}" was added to config.pbtxt and concurrency >= 3 (with some randomness).

Another observation is that the server may crash if is_remove_padding equals to 1, with following output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/triton_backend/bert/BertTritonModelInstance.cc:106

Signal (6) received.
 0# 0x000055D1DC9BF459 in /opt/tritonserver/bin/tritonserver
 1# 0x00007F816A83A090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F816ABF3911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F816ABFF38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F816ABFF3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F816ABFF6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# 0x00007F8160043BA2 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
11# 0x00007F816AC2BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F816BFA3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

If is_remove_padding equals to 0, the server doesn't crash, only output multiple lines of "response is nullptr". And perf_analyzer logs following output before it quits.

Request concurrency: 5
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [1] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [2] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [4] had error: pinned buffer: failed to perform CUDA copy: invalid argument

System info

  • GPU: GeForce RTX 2080Ti

  • GPU Driver: 525.60.11

  • Docker: 19.03.13

  • nvidia-container-runtime:

    • runc version 1.0.0-rc6+dev
    • commit: 96ec2177ae841256168fcf76954f7177af9446eb
    • spec: 1.0.1-dev
  • nvidia-container-cli:

    • version: 1.3.0
    • build date: 2020-09-16T12:35+0000
    • build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
  • branch: main

  • commit: 69a397e

  • docker base image: nvcr.io/nvidia/tritonserver:22.12-py3

Reproduced Steps

To Prepare the model, download the bert model and convert it with following command.

# assume the model was in /workspace/models/
mv /workspace/models/{bert_,}config.json
sed -i 's/}/  "bert_type":"bert"\n}/' /workspace/models/config.json

python3 /workspace/FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
 -infer_tensor_para_size 1 \
 -in_file /workspace/models/ \
 -saved_dir /workspace/models_out

mkdir -p triton-model-store/bert-fp32/
mv /workspace/models_out triton-model-store/bert-fp32/1

cat <<EOF > triton-model-store/bert-fp32/config.pbtxt
name: "bert-fp32"
backend: "fastertransformer"
default_model_filename: "bert"
max_batch_size: 1024
input [
  {
    name: "input_hidden_state"
    dims: [ 384, 768 ]
  },
  {
    name: "sequence_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
]
output [
  {
    name: "output_hidden_state"
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp32"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "bert"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "triton-model-store/bert-fp32/1/1-gpu/"
  }
}
parameters {
  key: "int8_mode"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_sparse"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_remove_padding"
  value: {
    string_value: "0"
  }
}
dynamic_batching {
}
EOF

Build the docker:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .

Run the server

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus=all --shm-size=4G  -v $(pwd):/ft_workspace --network host --name nv_ft triton_with_ft:22.12 bash

cd /ft_workspace

CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver --model-repository=/ft_workspace/triton-model-store

This python script was used to generate random test data:

import numpy as np
data = np.random.random([384,768]).astype(np.float32)  # np.float16
data.tofile('input_hidden_state')
L = np.asarray([384]).astype(np.int32)
L.tofile('sequence_lengths')

Generated data was put in mockdata folder. Then run perf_analyzer.

/path/to/perf_analyzer -m bert-fp32 --input-data ./mockdata/ --concurrency-range 1:10

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.