Giter Site home page Giter Site logo

tensorrtllm_backend's Introduction

Triton Inference Server

License

Warning

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.46.0 and corresponds to the 24.05 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.05 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.05-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.05-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Configure and Use Triton Inference Server

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

  • minimal – use as little code as possible that still produces the same problem
  • complete – provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
  • verifiable – test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

tensorrtllm_backend's People

Contributors

byshiue avatar kaiyux avatar rmccorm4 avatar shixiaowei02 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorrtllm_backend's Issues

Option 3. Build via Docker

Option 3. Build via Docker

DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

image

This feels like it's a problem with my machine environment ?

There's another question. When will Triton 23.10 be released.

Error when Building the TensorRT-LLM Backend with Option 3.

Hi,

I ran into error when building the TensorRT-LLM backend with Option 3. at step [trt_llm_builder 4/4], any idea what could be the problem?

...

 => [trt_llm_builder 3/4] COPY tensorrt_llm tensorrt_llm                   1.7s
 => ERROR [trt_llm_builder 4/4] RUN cd tensorrt_llm && python3 scripts  3747.2s
------
 > [trt_llm_builder 4/4] RUN cd tensorrt_llm && python3 scripts/build_wheel.py --trt_root="/usr/local/tensorrt" -i -c && cd ..:
#22 0.962 Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com

...

#22 3746.9 [100%] Linking CXX shared library libtensorrt_llm.so
#22 3747.1 /usr/bin/ld:/app/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a: file format not recognized; treating as linker script
#22 3747.1 /usr/bin/ld:/app/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a:1: syntax error
#22 3747.1 collect2: error: ld returned 1 exit status
#22 3747.1 gmake[3]: *** [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:714: tensorrt_llm/libtensorrt_llm.so] Error 1
#22 3747.1 gmake[2]: *** [CMakeFiles/Makefile2:677: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2
#22 3747.1 gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
#22 3747.1 gmake: *** [Makefile:179: tensorrt_llm] Error 2
#22 3747.1 Traceback (most recent call last):
#22 3747.1   File "/app/tensorrt_llm/scripts/build_wheel.py", line 248, in <module>
#22 3747.1     main(**vars(args))
#22 3747.1   File "/app/tensorrt_llm/scripts/build_wheel.py", line 152, in main
#22 3747.1     build_run(
#22 3747.1   File "/usr/lib/python3.10/subprocess.py", line 526, in run
#22 3747.1     raise CalledProcessError(retcode, process.args,
#22 3747.1 subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 24 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.
------
executor failed running [/bin/sh -c cd tensorrt_llm && python3 scripts/build_wheel.py --trt_root="${TRT_ROOT}" -i -c && cd ..]: exit code: 1

How to load llama. I tried, but I got an error

The model is vicuna_13b. I use build.py to generate a 2-gpu model.

E1025 09:17:53.664847 126 backend_model.cc:553] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1025 09:17:53.664952 126 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1025 09:17:53.664979 126 model_lifecycle.cc:757] failed to load 'tensorrt_llm'
E1025 09:17:53.665119 126 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.;
I1025 09:17:53.665245 126 server.cc:604]

run tritonserver failed

I build tritonserver docker with tensortllm backend and launch tritonserver
https://github.com/triton-inference-server/tensorrtllm_backend#launch-triton-server-within-ngc-container

get error:

model: chatglm2

+----------------+---------+--------------------------------------------------------------------------------------------------------------------+
| Model          | Version | Status                                                                                                             |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1       | READY                                                                                                              |
| preprocessing  | 1       | READY                                                                                                              |
| tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed:  |
|                |         | mpiSize == tp * pp (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:80)                                 |
|                |         | 1       0x7f923a86a645 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x17645) [0x7f923a86a645]  |
|                |         | 2       0x7f923a87748d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x2448d) [0x7f923a87748d]  |
|                |         | 3       0x7f923a8a9722 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56722) [0x7f923a8a9722]  |
|                |         | 4       0x7f923a8a4335 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x51335) [0x7f923a8a4335]  |
|                |         | 5       0x7f923a8a221b /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f21b) [0x7f923a8a221b]  |
|                |         | 6       0x7f923a885ec2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x32ec2) [0x7f923a885ec2]  |
|                |         | 7       0x7f923a885f75 TRITONBACKEND_ModelInstanceInitialize + 101                                                 |
|                |         | 8       0x7f93641a4116 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a0116) [0x7f93641a4116]                 |
|                |         | 9       0x7f93641a5356 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1356) [0x7f93641a5356]                 |
|                |         | 10      0x7f9364189bd5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x185bd5) [0x7f9364189bd5]                 |
|                |         | 11      0x7f936418a216 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x186216) [0x7f936418a216]                 |
|                |         | 12      0x7f936419531d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19131d) [0x7f936419531d]                 |
|                |         | 13      0x7f9363807f68 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99f68) [0x7f9363807f68]                              |
|                |         | 14      0x7f9364181adb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17dadb) [0x7f9364181adb]                 |
|                |         | 15      0x7f936418f865 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b865) [0x7f936418f865]                 |
|                |         | 16      0x7f9364194682 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190682) [0x7f9364194682]                 |
|                |         | 17      0x7f9364277230 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x273230) [0x7f9364277230]                 |
|                |         | 18      0x7f936427a923 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x276923) [0x7f936427a923]                 |
|                |         | 19      0x7f93643c3e52 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3bfe52) [0x7f93643c3e52]                 |
|                |         | 20      0x7f9363a72253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9363a72253]                         |
|                |         | 21      0x7f9363802b43 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9363802b43]                              |
|                |         | 22      0x7f9363893bb4 clone + 68

How to correctly obtain streaming output from Triton using gRPC?

I am using the Triton 23.08 image, and I can get the correct response from the HTTP client, as well as a complete response from the gRPC client. However, I would like to know if it is possible to obtain streaming output while having both decoupled and inflight_fused_batching enabled at the same time? I used my own gRPC streaming client and encountered the following error in Java:

io.grpc.StatusRuntimeException: UNIMPLEMENTED: ModelInfer RPC doesn't support models with decoupled transaction policy
	at io.grpc.Status.asRuntimeException(Status.java:539)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Alternatively, could you provide an example of how to correctly obtain a streaming response in Java?

Git Clone recursive failed

just do

git clone https://github.com/triton-inference-server/tensorrtllm_backend --recursive

Error message

fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:NVIDIA/TensorRT-LLM.git' into submodule path '/home/ubuntu/test/tensorrtllm_backend/tensorrt_llm' failed
Failed to clone 'tensorrt_llm'. Retry scheduled
Cloning into '/home/ubuntu/test/tensorrtllm_backend/tensorrt_llm'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Unable to load TensorRT LLM

I have tried this using the new triton inference (trtllm) 23.10 image and the official instructions mentioned here: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/. I have been getting this same error when loading the model

Using 523 tokens in paged KV cache.
E1028 18:34:33.926976 4434 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq)

I followed the instructions exactly and my cuda version is 12.2. Any help in debugging this would be really helpful - thanks

</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s> It's unnecessary for me

curl --location 'ip:8000/v2/models/ensemble/generate'
--header 'Content-Type: application/json'
--data '{"text_input": "根据主题:无人机
\n对下列段落进行详细的撰写:技术价值", "max_tokens": 500, "bad_words": "", "stop_words": ""}'

result:
{
"model_name": "ensemble",
"model_version": "1",
"sequence_end": false,
"sequence_id": 0,
"sequence_start": false,
"text_output": " 根据主题:无人机
\n对下列段落进行详细的撰写:技术价值与无人机应用\n\n无人机(Unmanned Aerial Vehicle,简称UAV)是一种无需人类驾驶员操作即可自主飞行的航空器。自20世纪50年代以来,无人机技术已经取得了显著的进步,如今已经成为军事、民用和科研领域的重要工具。无人机技术的价值主要体现在以下几个方面:\n\n1. 提高作战效率:无人机可以在危险的环境中执行侦查、监视和打击任务,降低人员伤亡风险。此外,无人机可以24小时不间断地执行任务,提高情报收集和处理的效率。\n\n2. 降低成本:无人机可以替代部分有人驾驶飞机,降低军事行动的成本。同时,无人机在民用领域的应用也可以降低运输、物流和监测等方面的费用。\n\n3. 提高监测精度:无人机搭载的高分辨率相机和传感器可以实时传输高清图像和数据,提高对地面目标的识别和定位精度。这对于环境监测、灾害救援、城市规划等领域具有重要的应用价值。\n\n4. 拓展科研领域:无人机技术的发展为各种科研任务提供了新的可能性。例如,无人机可以在极端环境下进行地质勘探、生物研究等任务,提高科研工作的效率和准确性。\n\n5. 促进创新:无人机技术的快速发展推动了相关领域的技术创新,如通信、导航、控制等技术的发展。这些技术的进步为无人机在各个领域的应用提供了更多的可能性。\n\n总之,无人机技术在军事、民用和科研领域具有广泛的应用前景。随着技术的不断进步,无人机将在未来发挥更大的作用,为人类带来更多的便利和价值。09年12月25日\n2009年12月25日 星期五 晴 今天是圣诞节,是西方的传统节日,也是基督教徒庆祝耶稣基督诞生的日子。在这个节日里,人们会互赠礼物、举行庆祝活动,还会与家人朋友共度美好时光。\n\n在这个特殊的日子里,我决定去逛一逛商场,看看有没有什么有趣的东西可以买。在商场里,我看到了许多圣诞主题的装饰和礼品,有圣诞树、圣诞老人、圣诞帽、圣诞礼物盒等等。这些装饰和礼品都充满了节日的气氛,让人感受到了浓浓的喜庆氛围。\n\n我还看到了许多商家在促销活动中,有的商家在打折销售商品,有的商家在举办抽奖活动,还有的商家在提供优惠券。这些促销活动吸引了众多顾客,商场里人潮涌动,热闹非凡。\n\n在商场里,我还看到了许多家庭"
}

some of the result is not help for me ,It's unnecessary for me. exp : 09年12月25日\n2009年12月25日 星期五 晴 今天是圣诞节,是西方的传统节日,也是基督教徒庆祝耶稣基督诞生的日子。在这个节日里,人们会互赠礼物、举行庆祝活动,还会与家人朋友共度美好时光。\n\n在这个特殊的日子里,我决定去逛一逛商场,看看有没有什么有趣的东西可以买。在商场里,我看到了许多圣诞主题的装饰和礼品,有圣诞树、圣诞老人、圣诞帽、圣诞礼物盒等等。这些装饰和礼品都充满了节日的气氛,让人感受到了浓浓的喜庆氛围。\n\n我还看到了许多商家在促销活动中,有的商家在打折销售商品,有的商家在举办抽奖活动,还有的商家在提供优惠券。这些促销活动吸引了众多顾客,商场里人潮涌动,热闹非凡。\n\n在商场里,我还看到了许多家庭"
}

i hope some of the result do not generate. how do i process ?

Ensemble definition not being saved to TRITON_GCS_MOUNT_DIRECTORY

When launching container with a remote GCS model repository --model-repository=gs://.... triton does not save the artifacts to the value specified in the environment variable TRITON_GCS_MOUNT_DIRECTORY as is implied according to https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_repository.html#google-cloud-storage

docker exec'ing into the pod shows that the files are still going to a /tmp/folderXXXX. I notice that they actually go to two separate folders (likely due to my model being compiled with tensor parallelism = 2)

I have attempted to just set TRITON_GCS_MOUNT_DIRECTORY == "/home/" to ensure that it is a pre-existing directory.

Full docker run command:

docker run --rm \
--net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 \
--gpus 0,1 \
--privileged \
--env TRITON_GCS_MOUNT_DIRECTORY="/home/" \
triton_trt_llm \
mpirun --allow-run-as-root  -n 1 /opt/tritonserver/bin/tritonserver --log-verbose=1 --model-repository=gs://gcs-path --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :  -n 1 /opt/tritonserver/bin/tritonserver --log-verbose=1 --model-repository=gs://gcs-path --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix1_ : 

The models still load and run, but prevent me from storing model weights in the same bucket as my ensemble definition (because I cannot know the temp directory where the parameter "gpt_model_path" should point if it is within the temp directory)

Triton supports http streaming return?

Hello, currently I refer to decouple The entire service is already running
image

I would like to ask, if I want to stream the return, can I only use the grpc method? Can the http method also stream the return? Thank you~

how to process batch

image

image

I hope there are two results,exp: "hello" ,"你好”,but one result , I do not know how to process batch .

Unload Model using REST API not relasing GPU memory

Hi,

I have used --model-control-mode=explicit option start the triton server without any model loading.

mpirun --allow-run-as-root  -n 1 /opt/tritonserver/bin/tritonserver --model-control-mode=explicit --model-repository=/tensorrtllm_backend/triton_model_repo --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ : 
I1026 10:41:05.243525 673 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1026 10:41:05.243592 673 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1026 10:41:05.243606 673 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1026 10:41:05.359233 673 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f3032000000' with size 268435456
I1026 10:41:05.359548 673 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1026 10:41:05.360189 673 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1026 10:41:05.360231 673 server.cc:631] 
+---------+---------------------------------------------------------+--------+
| Backend | Path                                                    | Config |
+---------+---------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {}     |
+---------+---------------------------------------------------------+--------+

I1026 10:41:05.360247 673 server.cc:674] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I1026 10:41:05.371781 673 metrics.cc:810] Collecting metrics for GPU 0: NVIDIA A10G
I1026 10:41:05.372040 673 metrics.cc:703] Collecting CPU metrics
I1026 10:41:05.372199 673 tritonserver.cc:2435] 
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                        |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                       |
| server_version                   | 2.37.0                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shar |
|                                  | ed_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging                                          |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                                                                       |
| model_control_mode               | MODE_EXPLICIT                                                                                                                |
| strict_model_config              | 1                                                                                                                            |
| rate_limit                       | OFF                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                     |
| min_supported_compute_capability | 6.0                                                                                                                          |
| strict_readiness                 | 1                                                                                                                            |
| exit_timeout                     | 30                                                                                                                           |
| cache_enabled                    | 0                                                                                                                            |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I1026 10:41:05.373521 673 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1026 10:41:05.373754 673 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1026 10:41:05.414772 673 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
Thu Oct 26 10:41:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   23C    P0    55W / 300W |    314MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     25537      C   ...onserver/bin/tritonserver      312MiB |
+-----------------------------------------------------------------------------+

then using the curl -X POST localhost:8000/v2/repository/models/ensemble/loadi am able to load all the model to memory and run inference using the cmd
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir tensorrt_llm/examples/gpt/gpt2 --streaming

I1026 10:46:56.945699 673 model_lifecycle.cc:462] loading: preprocessing:1
I1026 10:46:56.945809 673 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1026 10:46:56.945936 673 model_lifecycle.cc:462] loading: postprocessing:1
I1026 10:46:56.950418 673 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1026 10:46:56.962964 673 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 512
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 775 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 967, GPU 1110 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 968, GPU 1120 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +774, now: CPU 0, GPU 774 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 977, GPU 3128 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 977, GPU 3136 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 774 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 979, GPU 3144 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 979, GPU 3154 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 774 (MiB)
I1026 10:46:57.609433 673 model_lifecycle.cc:819] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] Using 175902 tokens in paged KV cache.
I1026 10:46:57.723072 673 model_lifecycle.cc:819] successfully loaded 'tensorrt_llm'
I1026 10:46:58.891710 673 model_lifecycle.cc:819] successfully loaded 'preprocessing'
I1026 10:46:58.892012 673 model_lifecycle.cc:462] loading: ensemble:1
I1026 10:46:58.892221 673 model_lifecycle.cc:819] successfully loaded 'ensemble'

memory usage at this time

Thu Oct 26 10:47:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   24C    P0    55W / 300W |  19846MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     25537      C   ...onserver/bin/tritonserver    19844MiB |
+-----------------------------------------------------------------------------+

using the below cmds i have unloaded all models

curl -X POST localhost:8000/v2/repository/models/tensorrt_llm/unload
curl -X POST localhost:8000/v2/repository/models/preprocessing/unload
curl -X POST localhost:8000/v2/repository/models/postprocessing/unload
curl -X POST localhost:8000/v2/repository/models/ensemble/unload
E1026 10:48:35.437792 673 model_repository_manager.cc:563] Invalid argument: ensemble ensemble contains models that are not available or ambiguous: tensorrt_llm
I1026 10:48:35.437886 673 model_lifecycle.cc:604] successfully unloaded 'ensemble' version 1
I1026 10:48:35.450508 673 model_lifecycle.cc:604] successfully unloaded 'tensorrt_llm' version 1
E1026 10:48:35.451997 673 model_repository_manager.cc:563] Invalid argument: ensemble ensemble contains models that are not available or ambiguous: preprocessing
E1026 10:48:35.464925 673 model_repository_manager.cc:563] Invalid argument: ensemble ensemble contains models that are not available or ambiguous: preprocessing
Cleaning up...
Cleaning up...
I1026 10:48:36.725288 673 model_lifecycle.cc:604] successfully unloaded 'postprocessing' version 1
I1026 10:48:36.982615 673 model_lifecycle.cc:604] successfully unloaded 'preprocessing' version 1

response of curl -X POST localhost:8000/v2/repository/index | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   320  100   320    0     0   175k      0 --:--:-- --:--:-- --:--:--  312k
[
    {
        "name": "ensemble",
        "reason": "unloaded",
        "state": "UNAVAILABLE",
        "version": "1"
    },
    {
        "name": "postprocessing",
        "reason": "unloaded",
        "state": "UNAVAILABLE",
        "version": "1"
    },
    {
        "name": "preprocessing",
        "reason": "unloaded",
        "state": "UNAVAILABLE",
        "version": "1"
    },
    {
        "name": "tensorrt_llm",
        "reason": "unloaded",
        "state": "UNAVAILABLE",
        "version": "1"
    }
]

After the model unloading also the nvidia-smi shows the same memory allocation. unloading models doesn't free the GPU memory

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   24C    P0    55W / 300W |  19038MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     25537      C   ...onserver/bin/tritonserver    19036MiB |
+-----------------------------------------------------------------------------+

Git submodule issue

When using the git submodule update --init --recursive command I kept receiving the fatal: Could not read from remote repository. error.

Changed .gitmodule to the following to make it work. Not sure if this is worth doing in the repo to avoid the error occurring for other users but thought it would be worth mentioning.

[submodule "tensorrt_llm"]
	path = tensorrt_llm
	url = https://github.com/NVIDIA/TensorRT-LLM.git

Trition Server Fails to load with Error: "key 'tokens_per_block' not found"

Hello I am having issues when using the existing docker image:
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

I create the model repo as per the instructions here

I am using the Model: meta-llama/Llama-2-7b-hf
I have converted the model into TRT engines and have tested it using the run.py provided in tensorrtllm_backend/tensorrt_llm/examples/llama/run.py

I mount the model repo with the TRT model and try to launch the triton server using the provided script:
python3 launch_triton_server.py --world_size=8 --model_repo=/opt/tritonserver/triton_model_repo/

The server launch fails with the following error:

| tensorrt_llm  | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'tokens_per_block' not found |

The stop_words does not work with codeLlama?

I use codeLlama-7b model, and use ensemble of "https://github.com/triton-inference-server/tensorrtllm_backend/tree/e514b4af5ec87477b095d3ba6fe63cc7b797055f/all_models/inflight_batcher_llm".

The prompt is "def quickSort", the stop_words is "quickSort,quickSort(right)", and I print the result of "_to_word_list_format"
stop_words = self._to_word_list_format(stop_words_dict) self.logger.log_info(f"================== preprocessing execute stop_words: {stop_words}")

I get the following result:
model.py:137] ================== preprocessing execute stop_words: [[[ 1 4996 13685 1 4996 13685 29898 1266 29897] [ 3 9 -1 -1 -1 -1 -1 -1 -1]]]

However, the code completion result is:
def quickSort(arr):\n if len(arr) \u003c= 1:\n return arr\n else:\n pivot = arr[0]\n left = []\n right = []\n for i in arr:\n if i \u003c pivot:\n left.append(i))\n else:\n right.append(i))\n return quickSort(left)) + [pivot]) + quickSort(right))\n\n\n\n\n\n\n

As we can see, infer does not stop early when it meets the "quickSort" or "quickSort(right)", this is weird. Is there something wrong with where I am using it?

Thank u.

How to pass parameter in ensemble model?

As the normal procedure for tensorrtllm_backend is preprocessing -> (tensorrt_llm) process -> postprocessing. How to pass the customer parameter from the request, like request token length.

In my understanding, tensorrt_llm backend will finish the infer, it won't work to add input and output parameter. then the issue coming, in ensemble pipeline, how to pass the parameter from the preprocess module to poseprocess module.

Please any way to solve this issue?

OOM for 7b model with int4 Quantize

env :
gpu v100s(32G),tritonserver version 23.10

| tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error |
|                |         |  in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmB |
|                |         | uffers.h:112)

There is no branch/tag called r23.10

Triton 23.10 has released several days ago. I tried to build the TensorRT-LLM Backend via the build.py script. And the script can't find branch r23.10 in the repository.
Here is the log.

+ mkdir -p /tmp/tritonbuild
+ cd /tmp/tritonbuild
+ git clone --single-branch --depth=1 -b r23.10 https://github.com/triton-inference-server/tensorrtllm_backend tensorrtllm
Cloning into 'tensorrtllm'...
warning: Could not find remote branch r23.10 to clone.
fatal: Remote branch r23.10 not found in upstream origin
error: build failed

LLama2 request receive 400

  • Triton startup log:
I1024 04:13:54.117054 169 server.cc:674]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+

I1024 04:13:54.419887 169 metrics.cc:810] Collecting metrics for GPU 0: NVIDIA A30
I1024 04:13:54.419924 169 metrics.cc:810] Collecting metrics for GPU 1: NVIDIA A30
I1024 04:13:54.419933 169 metrics.cc:810] Collecting metrics for GPU 2: NVIDIA A30
I1024 04:13:54.419940 169 metrics.cc:810] Collecting metrics for GPU 3: NVIDIA A30
I1024 04:13:54.419947 169 metrics.cc:810] Collecting metrics for GPU 4: NVIDIA A30
I1024 04:13:54.419956 169 metrics.cc:810] Collecting metrics for GPU 5: NVIDIA A30
I1024 04:13:54.419961 169 metrics.cc:810] Collecting metrics for GPU 6: NVIDIA A30
I1024 04:13:54.419969 169 metrics.cc:810] Collecting metrics for GPU 7: NVIDIA A30
I1024 04:13:54.420429 169 metrics.cc:703] Collecting CPU metrics
I1024 04:13:54.420878 169 tritonserver.cc:2435]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                      |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                     |
| server_version                   | 2.37.0                                                                                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_dat |
|                                  | a parameters statistics trace logging                                                                                                                                      |
| model_repository_path[0]         | /app/tensorrtllm_backend/triton_model_repo                                                                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                                                                                  |
| strict_model_config              | 1                                                                                                                                                                          |
| rate_limit                       | OFF                                                                                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{4}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{5}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{6}    | 67108864                                                                                                                                                                   |
| cuda_memory_pool_byte_size{7}    | 67108864                                                                                                                                                                   |
| min_supported_compute_capability | 6.0                                                                                                                                                                        |
| strict_readiness                 | 1                                                                                                                                                                          |
| exit_timeout                     | 30                                                                                                                                                                         |
| cache_enabled                    | 0                                                                                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1024 04:13:54.421874 169 grpc_server.cc:2345]
+----------------------------------------------+---------+
| GRPC KeepAlive Option                        | Value   |
+----------------------------------------------+---------+
| keepalive_time_ms                            | 7200000 |
| keepalive_timeout_ms                         | 20000   |
| keepalive_permit_without_calls               | 0       |
| http2_max_pings_without_data                 | 2       |
| http2_min_recv_ping_interval_without_data_ms | 300000  |
| http2_max_ping_strikes                       | 2       |
+----------------------------------------------+---------+

I1024 04:13:54.422821 169 grpc_server.cc:101] Ready for RPC 'Check', 0
I1024 04:13:54.422852 169 grpc_server.cc:101] Ready for RPC 'ServerLive', 0
I1024 04:13:54.422861 169 grpc_server.cc:101] Ready for RPC 'ServerReady', 0
I1024 04:13:54.422868 169 grpc_server.cc:101] Ready for RPC 'ModelReady', 0
I1024 04:13:54.422878 169 grpc_server.cc:101] Ready for RPC 'ServerMetadata', 0
I1024 04:13:54.422886 169 grpc_server.cc:101] Ready for RPC 'ModelMetadata', 0
I1024 04:13:54.422903 169 grpc_server.cc:101] Ready for RPC 'ModelConfig', 0
I1024 04:13:54.422911 169 grpc_server.cc:101] Ready for RPC 'SystemSharedMemoryStatus', 0
I1024 04:13:54.422920 169 grpc_server.cc:101] Ready for RPC 'SystemSharedMemoryRegister', 0
I1024 04:13:54.422928 169 grpc_server.cc:101] Ready for RPC 'SystemSharedMemoryUnregister', 0
I1024 04:13:54.422937 169 grpc_server.cc:101] Ready for RPC 'CudaSharedMemoryStatus', 0
I1024 04:13:54.422944 169 grpc_server.cc:101] Ready for RPC 'CudaSharedMemoryRegister', 0
I1024 04:13:54.422951 169 grpc_server.cc:101] Ready for RPC 'CudaSharedMemoryUnregister', 0
I1024 04:13:54.422959 169 grpc_server.cc:101] Ready for RPC 'RepositoryIndex', 0
I1024 04:13:54.422967 169 grpc_server.cc:101] Ready for RPC 'RepositoryModelLoad', 0
I1024 04:13:54.422973 169 grpc_server.cc:101] Ready for RPC 'RepositoryModelUnload', 0
I1024 04:13:54.422981 169 grpc_server.cc:101] Ready for RPC 'ModelStatistics', 0
I1024 04:13:54.422989 169 grpc_server.cc:101] Ready for RPC 'Trace', 0
I1024 04:13:54.422996 169 grpc_server.cc:101] Ready for RPC 'Logging', 0
I1024 04:13:54.423012 169 grpc_server.cc:350] Thread started for CommonHandler
I1024 04:13:54.423347 169 infer_handler.cc:703] New request handler for ModelInferHandler, 0
I1024 04:13:54.423366 169 infer_handler.h:1048] Thread started for ModelInferHandler
I1024 04:13:54.423673 169 infer_handler.cc:703] New request handler for ModelInferHandler, 0
I1024 04:13:54.423691 169 infer_handler.h:1048] Thread started for ModelInferHandler
I1024 04:13:54.424013 169 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0
I1024 04:13:54.424030 169 infer_handler.h:1048] Thread started for ModelStreamInferHandler
I1024 04:13:54.424038 169 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I1024 04:13:54.424314 169 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I1024 04:13:54.467520 169 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I1024 04:15:33.276895 169 http_server.cc:3452] HTTP request: 2 /v2/models/ensemble/versions/1/generate
I1024 04:15:33.276981 169 http_server.cc:3538] HTTP error: 2 /v2/models/ensemble/versions/1/generate - 400
  • Curl Response
root@f1915ba209b6:/app# curl -v -H "Content-Type: application/json" -X POST localhost:8000/v2/models/ensemble/versions/1/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> POST /v2/models/ensemble/versions/1/generate HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 96
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host localhost left intact
root@f1915ba209b6:/app#
  • Build engine command:
python build.py --model_dir /app/meta-llama_Llama-2-70b-chat-hf \
                --dtype float16 \
                --parallel_build \
                --use_inflight_batching \
                --paged_kv_cache \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /app/tensorrt_llm_engines/llama2/70B/trt_engines/fp16/8-gpu/ \
                --world_size 8 \
                --tp_size 8
  • Startup triton command
python3 scripts/launch_triton_server.py --world_size=8 --model_repo=/app/tensorrtllm_backend/triton_model_repo
  • My ensemble config.pbtxt is copy from all_models/inflight_batcher_llm/*

how to write the config.pbtxt file

If I want to deploy a model other than the four model in the example, how should I define the names and dimensions of the input and output in the config.pbtxt file?

Why is padding_side left for llama tokenizer?

In the example pre/postprocessing.

elif tokenizer_type == 'llama':
self.tokenizer = LlamaTokenizer.from_pretrained(
tokenizer_dir, legacy=False, padding_side='left')
, the padding_side is set to left for llama (and other models). Is it for some reason?

Padding_side is right for llama on HF: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/tokenizer_config.json#L25

Failed, NCCL error 'internal error - please report this is sue to the NCCL developers'

There was an error reported.

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=./all_models/inflight_batcher_llm/

image

Transformational Model

python examples/baichuan/build.py --model_version v2_13b --max_output_len=1024 --model_dir ./models/Baichuan2-13B-Chat/ --dtype float16 --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --remove_input_padding --use_inflight_batching --paged_kv_cache --output_dir ./models/tmp/baichuan_v2_13b/trt_engines/fp16/2-gpu/

Adapt all_models/inflight_batcher_llm files

all_models/inflight_batcher_llm/postprocessing/1/model.py  L65
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir**, use_fast=False, trust_remote_code=True**) 
all_models/inflight_batcher_llm/preprocessing/1/model.py  L69
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir**, use_fast=False, trust_remote_code=True**) 

docker run
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v $(pwd)/all_models:/app/all_models -v $(pwd)/models:/models triton_trt_llm bash

Failed to run `python3 scripts/launch_triton_server.py --model_repo all_models/inflight_batcher_llm --world_size 2`

This is error message:

E1102 06:27:15.205534 2025 model_lifecycle.cc:622] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal

And here is the files:

all_models/inflight_batcher_llm
├── ensemble
│   ├── 1
│   └── config.pbtxt
├── postprocessing
│   ├── 1
│   │   ├── model.py
│   └── config.pbtxt
├── preprocessing
│   ├── 1
│   │   ├── model.py
│   └── config.pbtxt
└── tensorrt_llm
    ├── 1
    │   ├── config.json
    │   ├── gpt_float16_tp2_rank0.engine
    │   └── gpt_float16_tp2_rank1.engine
    └── config.pbtxt

How to find out why the results are inconsistent with vllm

The model I'm using is vicuna-13b.
I tested more than 100 cases, and about 10% of the results of inference on tensorrt-backend are different from the results of inference on vllm. How to find out the specific reason?

All the parameters during vllm inference are here, but I don’t know how to pass them on tensorrt.
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.1, top_p=0.35, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True

CUDA runtime error while pressure test

Inference failed while using ab with more than 3 concurrency but was ok with 1 or 2 concurrency. Using an A10G GPU, with Driver Version: 545.23.06,CUDA Version: 12.3, trt version:9.1, vicuna 13b-1.5-16k。Have any workround?

[TensorRT-LLM][WARNING] Step function failed, continuing.
2023-11-01 10:20:21,882 PID:111 INFO totally input 7 tokens
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f2d2e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f2d2e81e045]
2 0x7f2d2e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f2d2e87fa8a]
3 0x7f2d2e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f2d2e84c821]
4 0x7f2d2e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f2d2e8515c7]
5 0x7f2d2e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f2d2e83b241]
6 0x7f2d2e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f2d2e83c38a]
7 0x7f2d92e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2d92e64253]
8 0x7f2d92bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2d92bf4ac3]
9 0x7f2d92c85bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 760313751: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f2d2e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f2d2e81e045]
2 0x7f2d2e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f2d2e87fa8a]
3 0x7f2d2e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f2d2e84c821]
4 0x7f2d2e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f2d2e8515c7]
5 0x7f2d2e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f2d2e83b241]
6 0x7f2d2e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f2d2e83c38a]
7 0x7f2d92e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2d92e64253]
8 0x7f2d92bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2d92bf4ac3]
9 0x7f2d92c85bf4 clone + 68
[TensorRT-LLM][WARNING] Step function failed, continuing.

if 'decoupled_mode' is set to True (streaming enabled), 'inflight_batcher_llm_client.py' only outputs one token."

image
"When I follow the ’Llama tutorial‘ and run tensorrtllm triton serve llama, if 'decoupled_mode' is set to False (streaming disabled), there are no issues with the output of 'inflight_batcher_llm_client.py'. However, if 'decoupled_mode' is set to True (streaming enabled), 'inflight_batcher_llm_client.py' only outputs one token."

'decoupled_mode' is set to False:
image

'decoupled_mode' is set to True:
image

how can I stop my model generation? stop words or any other way?

Hi there
I had load the codeLlama in tritonserver , and I passed the stop words just like fastertransformer backend before, But tensorrt did not stop. Some information in this issue. NVIDIA/TensorRT-LLM#90
I tried some log and change some code, I found the action which can not stop the model maybe in inflight batch module, Am I right? Or any other way to stop the model, I think the model should stop generation when the eos come up, but not use stop words.
thx.

is support multi node in triton inference server?

is support multi node in triton inference server?

i build llama-7b for tensorrtllm_backend and execute triton inference server
i have a 4 GPUS but triton inference server load only 1 GPUS

image
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

build (llama2)

python build.py --model_dir ${model_directory} \
                --dtype float16 \
                --use_gpt_attention_plugin bfloat16 \
                --use_inflight_batching \
                --paged_kv_cache \
                --remove_input_padding \
                --use_gemm_plugin float16 \
                --output_dir engines/fp16/1-gpu/

run

tritonserver --model-repo=/tensorrtllm_backend/triton_model_repo --disable-auto-complete-config
image

How to get more than one Inference results with one request?

In vLLM, the parameter of "N" is about how many Inference results we can get.

For example, when the "N" is 2, we can get the results as flowing:

"Choices": [
      {
        "FinishReason": "stop",
        "Index": 0,
        "Logprobs": {
          "TextOffset": [],
          "TokenLogprobs": [],
          "Tokens": []
        },
        "Text": "(arr, left, right):"
      },
      {
        "FinishReason": "stop",
        "Index": 1,
        "Logprobs": {
          "TextOffset": [],
          "TokenLogprobs": [],
          "Tokens": []
        },
        "Text": "(arr, low, high):"
      }
    ]

In Triton, how to set parameters to achieve the same effect as above?

Thanks.

ibtensorrt_llm_batch_manager_static.pre_cxx11.a: file format not recognized; treating as linker script

image

When I follow the build instructions in your Readme to build the image, I encounter this problem.

image

nvidia-smi
image

nvcc
$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jul_11_02:20:44_PDT_2023 Cuda compilation tools, release 12.2, V12.2.128 Build cuda_12.2.r12.2/compiler.33053471_0

pip list
Package Version


boltons 23.0.0
brotlipy 0.7.0
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 2.0.4
conda 23.5.2
conda-content-trust 0.1.3
conda-libmamba-solver 23.5.0
conda-package-handling 2.1.0
conda_package_streaming 0.8.0
cryptography 41.0.4
filelock 3.12.4
fsspec 2023.9.2
huggingface-hub 0.17.2
idna 3.4
jsonpatch 1.32
jsonpointer 2.1
libmambapy 1.4.1
packaging 23.0
pip 23.1.2
pluggy 1.0.0
pycosat 0.6.4
pycparser 2.21
pyOpenSSL 23.2.0
PySocks 1.7.1
PyYAML 6.0.1
requests 2.29.0
ruamel.yaml 0.17.21
setuptools 67.8.0
six 1.16.0
toolz 0.12.0
tqdm 4.65.0
typing_extensions 4.8.0
urllib3 1.26.16
wheel 0.38.4
zstandard 0.19.0

cmake --version
cmake version 3.16.3
CMake suite maintained and supported by Kitware (kitware.com/cmake).

baichuan2-13b exec error

root@GPU-26:/tensorrtllm_backend/tensorrtllm_backend/tensorrtllm_backend# CUDA_VISIBLE_DEVICES=0,3 python3 ./scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo
root@GPU-26:/tensorrtllm_backend/tensorrtllm_backend/tensorrtllm_backend# I1030 09:08:30.450984 1398 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f3edc000000' with size 268435456
I1030 09:08:30.454704 1398 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1030 09:08:30.454713 1398 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1030 09:08:30.472889 1399 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ff556000000' with size 268435456
I1030 09:08:30.492939 1399 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1030 09:08:30.492958 1399 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1030 09:08:30.796900 1398 model_lifecycle.cc:461] loading: tensorrt_llm:2
I1030 09:08:30.796944 1398 model_lifecycle.cc:461] loading: preprocessing:1
I1030 09:08:30.796963 1398 model_lifecycle.cc:461] loading: postprocessing:1
I1030 09:08:30.815125 1399 model_lifecycle.cc:461] loading: tensorrt_llm:2
I1030 09:08:30.815165 1399 model_lifecycle.cc:461] loading: preprocessing:1
I1030 09:08:30.815184 1399 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] Cannot find parameter with name: batch_scheduler_policy
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1030 09:08:30.880197 1398 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1030 09:08:30.880584 1398 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] Cannot find parameter with name: batch_scheduler_policy
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1030 09:08:30.897152 1399 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1030 09:08:30.897646 1399 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
I1030 09:08:31.500529 1399 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

I1030 09:08:31.503903 1398 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

E1030 09:08:31.669733 1398 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

E1030 09:08:31.669950 1398 model_lifecycle.cc:621] failed to load 'postprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

I1030 09:08:31.669989 1398 model_lifecycle.cc:756] failed to load 'postprocessing'
E1030 09:08:31.686577 1399 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

E1030 09:08:31.686780 1399 model_lifecycle.cc:621] failed to load 'postprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize

I1030 09:08:31.686815 1399 model_lifecycle.cc:756] failed to load 'postprocessing'
I1030 09:08:32.939433 1398 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

I1030 09:08:32.948539 1399 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

E1030 09:08:33.484284 1399 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

E1030 09:08:33.484479 1399 model_lifecycle.cc:621] failed to load 'preprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

I1030 09:08:33.484529 1399 model_lifecycle.cc:756] failed to load 'preprocessing'
E1030 09:08:33.497454 1398 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

E1030 09:08:33.497550 1398 model_lifecycle.cc:621] failed to load 'preprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize

I1030 09:08:33.497571 1398 model_lifecycle.cc:756] failed to load 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 7653 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 7653 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8683, GPU 26856 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8685, GPU 26866 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8683, GPU 56434 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8685, GPU 56444 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +7649, now: CPU 0, GPU 7649 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +7649, now: CPU 0, GPU 7649 (MiB)
E1030 09:08:44.006947 1398 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1030 09:08:44.007090 1398 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 2: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1030 09:08:44.007124 1398 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1030 09:08:44.007384 1398 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'postprocessing' which has no loaded version. Model 'postprocessing' loading failed with error: version 1 is at UNAVAILABLE state: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
;
I1030 09:08:44.007619 1398 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1030 09:08:44.007837 1398 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

I1030 09:08:44.008106 1398 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize |
| preprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize |
| tensorrt_llm | 2 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and pa |
| | | ged KV cache. |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

E1030 09:08:44.013076 1399 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1030 09:08:44.013151 1399 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 2: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1030 09:08:44.013168 1399 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1030 09:08:44.013322 1399 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'postprocessing' which has no loaded version. Model 'postprocessing' loading failed with error: version 1 is at UNAVAILABLE state: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.

At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
;
I1030 09:08:44.013473 1399 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1030 09:08:44.013625 1399 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix1_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

I1030 09:08:44.013837 1399 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize |
| preprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize |
| tensorrt_llm | 2 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and pa |
| | | ged KV cache. |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1030 09:08:44.073052 1398 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800 80GB PCIe
I1030 09:08:44.073087 1398 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A800 80GB PCIe
I1030 09:08:44.076025 1398 metrics.cc:710] Collecting CPU metrics
I1030 09:08:44.076206 1398 tritonserver.cc:2458]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

I1030 09:08:44.076217 1398 server.cc:293] Waiting for in-flight requests to complete.
I1030 09:08:44.076221 1398 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1030 09:08:44.076228 1398 server.cc:324] All models are stopped, unloading models
I1030 09:08:44.076234 1398 server.cc:331] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
I1030 09:08:44.077887 1399 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800 80GB PCIe
I1030 09:08:44.077913 1399 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A800 80GB PCIe
I1030 09:08:44.078310 1399 metrics.cc:710] Collecting CPU metrics
I1030 09:08:44.078464 1399 tritonserver.cc:2458]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

I1030 09:08:44.078473 1399 server.cc:293] Waiting for in-flight requests to complete.
I1030 09:08:44.078477 1399 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1030 09:08:44.078484 1399 server.cc:324] All models are stopped, unloading models
I1030 09:08:44.078488 1399 server.cc:331] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[9409,1],0]
Exit code: 1

===========

i use baichuan2-13b

launch llama triton_server Error

launch_triton_server.py ERROR

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo

I1102 08:26:22.668199 1349 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I1102 08:26:22.668248 1349 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.15
I1102 08:26:22.668252 1349 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.15
I1102 08:26:22.859278 1349 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f4e2a000000' with size 268435456
I1102 08:26:22.859857 1349 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
E1102 08:26:22.860896 1349 model_repository_manager.cc:1307] Poll failed for model directory 'article_summary': failed to open text file for read /tensorrtllm_backend/triton_model_repo/article_summary/config.pbtxt: No such file or directory
I1102 08:26:22.862390 1349 model_lifecycle.cc:462] loading: tensorrt_llm:1
I1102 08:26:22.862417 1349 model_lifecycle.cc:462] loading: postprocessing:1
I1102 08:26:22.862431 1349 model_lifecycle.cc:462] loading: preprocessing:1
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1102 08:26:22.878460 1349 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1102 08:26:22.878475 1349 python_be.cc:2115] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
postprocess TritonPythonModel.initialize
I1102 08:26:23.679981 1349 model_lifecycle.cc:819] successfully loaded 'postprocessing'
preprocess TritonPythonModel.initialize
I1102 08:26:24.854898 1349 model_lifecycle.cc:819] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 14727 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16005, GPU 15040 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 16006, GPU 15050 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +14726, now: CPU 0, GPU 14726 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16025, GPU 19908 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 16025, GPU 19916 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14726 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 16027, GPU 19924 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 16027, GPU 19934 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 14726 (MiB)
[TensorRT-LLM][INFO] Using 5024 tokens in paged KV cache.
[TensorRT-LLM][WARNING] max_num_sequences is smaller than  2 times the engine max_batch_size. Batches smaller than max_batch_size will be executed.
I1102 08:26:35.980557 1349 model_lifecycle.cc:819] successfully loaded 'tensorrt_llm'
I1102 08:26:35.981003 1349 model_lifecycle.cc:462] loading: ensemble:1
I1102 08:26:35.981210 1349 model_lifecycle.cc:819] successfully loaded 'ensemble'
I1102 08:26:35.981279 1349 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1102 08:26:35.981328 1349 server.cc:631]
+-------------+--------------------------------------------------+--------------------------------------------------+
| Backend     | Path                                             | Config                                           |
+-------------+--------------------------------------------------+--------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pyt | {}                                               |
|             | orch.so                                          |                                                  |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton | {"cmdline":{"auto-complete-config":"false","back |
|             | _tensorrtllm.so                                  | end-directory":"/opt/tritonserver/backends","min |
|             |                                                  | -compute-capability":"6.000000","default-max-bat |
|             |                                                  | ch-size":"4"}}                                   |
|             |                                                  |                                                  |
| python      | /opt/tritonserver/backends/python/libtriton_pyth | {"cmdline":{"auto-complete-config":"false","back |
|             | on.so                                            | end-directory":"/opt/tritonserver/backends","min |
|             |                                                  | -compute-capability":"6.000000","shm-region-pref |
|             |                                                  | ix-name":"prefix0_","default-max-batch-size":"4" |
|             |                                                  | }}                                               |
|             |                                                  |                                                  |
+-------------+--------------------------------------------------+--------------------------------------------------+

I1102 08:26:35.981354 1349 server.cc:674]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+

I1102 08:26:36.012527 1349 metrics.cc:810] Collecting metrics for GPU 0: NVIDIA A10
I1102 08:26:36.013351 1349 metrics.cc:703] Collecting CPU metrics
I1102 08:26:36.013463 1349 tritonserver.cc:2435]
+----------------------------------+----------------------------------------------------------------------------------+
| Option                           | Value                                                                            |
+----------------------------------+----------------------------------------------------------------------------------+
| server_id                        | triton                                                                           |
| server_version                   | 2.37.0                                                                           |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) sch |
|                                  | edule_policy model_configuration system_shared_memory cuda_shared_memory binary_ |
|                                  | tensor_data parameters statistics trace logging                                  |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo                                           |
| model_control_mode               | MODE_NONE                                                                        |
| strict_model_config              | 1                                                                                |
| rate_limit                       | OFF                                                                              |
| pinned_memory_pool_byte_size     | 268435456                                                                        |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                         |
| min_supported_compute_capability | 6.0                                                                              |
| strict_readiness                 | 1                                                                                |
| exit_timeout                     | 30                                                                               |
| cache_enabled                    | 0                                                                                |
+----------------------------------+----------------------------------------------------------------------------------+

I1102 08:26:36.013470 1349 server.cc:305] Waiting for in-flight requests to complete.
I1102 08:26:36.013476 1349 server.cc:321] Timeout 30: Found 0 model versions that have in-flight inferences
I1102 08:26:36.013693 1349 server.cc:336] All models are stopped, unloading models
I1102 08:26:36.013702 1349 server.cc:343] Timeout 30: Found 4 live models and 0 in-flight non-inference requests
I1102 08:26:36.013781 1349 model_lifecycle.cc:604] successfully unloaded 'ensemble' version 1
I1102 08:26:36.088566 1349 model_lifecycle.cc:604] successfully unloaded 'tensorrt_llm' version 1
I1102 08:26:37.013769 1349 server.cc:343] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
I1102 08:26:37.100363 1349 model_lifecycle.cc:604] successfully unloaded 'postprocessing' version 1
I1102 08:26:37.309030 1349 model_lifecycle.cc:604] successfully unloaded 'preprocessing' version 1
I1102 08:26:38.013851 1349 server.cc:343] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35503,1],0]
  Exit code:    1
--------------------------------------------------------------------------
  • Model Engine Build CMD
    python ../examples/llama/build.py --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_rmsnorm_plugin float16 --enable_context_fmha --model_dir=./xx/article_summary/ --rms_norm_eps=1e-05 --max_batch_size=32 --use_inflight_batching --paged_kv_cache

  • triton_model_repo/preprocessing/config.pbtxt

parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/tensorrtllm_backend/triton_model_repo/article_summary/"
  }
}
  • triton_model_repo/tensorrt_llm/config.pbtxt
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}

Failed to load llama in triton server, parsed error

Failed to load engine in tensorrt llm backend, refer to config files in all_models/inflight_batcher_llm. It always prompts "..parse error" like following. attached my own config.pbtxt at last.

Please anyone suffering with the similar issue with me, how to solved this?

run cmd

# tritonserver --model-repository=/tensorrtllm_backend/triton_model_repo

engine location

$ tree -L 1 tensorrtllm_backend/triton_model_repo/tensorrt_llm/
tensorrtllm_backend/triton_model_repo/tensorrt_llm/
|-- 1
`-- config.pbtxt

Error log

tritonserver --model-repository=/tensorrtllm_backend/triton_model_repo
I1030 07:31:12.979160 4933 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f52d0000000' with size 268435456
I1030 07:31:12.979321 4933 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1030 07:31:12.980735 4933 model_lifecycle.cc:461] loading: preprocessing:1
I1030 07:31:12.980758 4933 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1030 07:31:12.980799 4933 model_lifecycle.cc:461] loading: postprocessing:1
E1030 07:31:13.009883 4933 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
E1030 07:31:13.009933 4933 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
I1030 07:31:13.009940 4933 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
I1030 07:31:14.560434 4933 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1030 07:31:14.861963 4933 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1030 07:31:15.621445 4933 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1030 07:31:16.776123 4933 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I1030 07:31:16.776258 4933 server.cc:592]

config.pbtxt

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 128

model_transaction_policy {
decoupled: False
}

input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ 1 ]
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "1"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm"
}
}

dockerfile failed at build_wheel.py. "file STRINGS file "/include/NvInferVersion.h" cannot be read."

In linux with GPU, run the dockerfile build with

cd tensorrtllm_backend
git submodule update --init --recursive
git lfs install
git lfs pull

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Which failed at cd tensorrt_llm && python3 scripts/build_wheel.py --trt_root=\"${TRT_ROOT}\" -i -c && cd ..

Rerunning this line runs into file STRINGS file "/include/NvInferVersion.h" cannot be read.

~/tensorrtllm_backend release/0.5.0 !1 ?2 ❯ cd tensorrt_llm && python3 scripts/build_wheel.py --trt_root=\"${TRT_ROOT}\" -i -c && cd ..
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: build in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.0.3)
Requirement already satisfied: torch in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 2)) (2.0.1)
Requirement already satisfied: transformers==4.31.0 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (4.31.0)
Requirement already satisfied: diffusers==0.15.0 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 4)) (0.15.0)
Requirement already satisfied: accelerate==0.20.3 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 5)) (0.20.3)
Requirement already satisfied: colored in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 6)) (1.4.4)
Requirement already satisfied: polygraphy in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 7)) (0.49.0)
Requirement already satisfied: onnx>=1.12.0 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 8)) (1.14.1)
Requirement already satisfied: mpi4py in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 9)) (3.1.5)
Requirement already satisfied: tensorrt>=8.6.0 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 10)) (8.6.1.post1)
Requirement already satisfied: numpy in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 11)) (1.24.4)
Requirement already satisfied: cuda-python==12.2.0 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 12)) (12.2.0)
Requirement already satisfied: sentencepiece>=0.1.99 in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 13)) (0.1.99)
Requirement already satisfied: wheel in /usr/lib/python3/dist-packages (from -r requirements.txt (line 14)) (0.34.2)
Requirement already satisfied: lark in /home/hayley/.local/lib/python3.8/site-packages (from -r requirements.txt (line 15)) (1.1.7)
Requirement already satisfied: filelock in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (3.12.4)
Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (0.17.3)
Requirement already satisfied: packaging>=20.0 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/lib/python3/dist-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (5.3.1)
Requirement already satisfied: regex!=2019.12.17 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (2023.10.3)
Requirement already satisfied: requests in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (2.31.0)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (0.4.0)
Requirement already satisfied: tqdm>=4.27 in /home/hayley/.local/lib/python3.8/site-packages (from transformers==4.31.0->-r requirements.txt (line 3)) (4.66.1)
Requirement already satisfied: Pillow in /home/hayley/.local/lib/python3.8/site-packages (from diffusers==0.15.0->-r requirements.txt (line 4)) (10.0.1)
Requirement already satisfied: importlib-metadata in /home/hayley/.local/lib/python3.8/site-packages (from diffusers==0.15.0->-r requirements.txt (line 4)) (6.8.0)
Requirement already satisfied: psutil in /usr/lib/python3/dist-packages (from accelerate==0.20.3->-r requirements.txt (line 5)) (5.5.1)
Requirement already satisfied: cython in /home/hayley/.local/lib/python3.8/site-packages (from cuda-python==12.2.0->-r requirements.txt (line 12)) (3.0.2)
Requirement already satisfied: pyproject_hooks in /home/hayley/.local/lib/python3.8/site-packages (from build->-r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: tomli>=1.1.0 in /home/hayley/.local/lib/python3.8/site-packages (from build->-r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: typing-extensions in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (4.8.0)
Requirement already satisfied: sympy in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (1.12)
Requirement already satisfied: networkx in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (3.1)
Requirement already satisfied: jinja2 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (3.1.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.7.99)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.7.101)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (8.5.0.96)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.10.3.66)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (10.2.10.91)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.7.4.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (2.14.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (11.7.91)
Requirement already satisfied: triton==2.0.0 in /home/hayley/.local/lib/python3.8/site-packages (from torch->-r requirements.txt (line 2)) (2.0.0)
Requirement already satisfied: setuptools in /home/hayley/.local/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch->-r requirements.txt (line 2)) (49.4.0)
Requirement already satisfied: cmake in /home/hayley/.local/lib/python3.8/site-packages (from triton==2.0.0->torch->-r requirements.txt (line 2)) (3.27.6)
Requirement already satisfied: lit in /home/hayley/.local/lib/python3.8/site-packages (from triton==2.0.0->torch->-r requirements.txt (line 2)) (17.0.2)
Requirement already satisfied: protobuf>=3.20.2 in /home/hayley/.local/lib/python3.8/site-packages (from onnx>=1.12.0->-r requirements.txt (line 8)) (4.24.4)
Requirement already satisfied: fsspec in /home/hayley/.local/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.14.1->transformers==4.31.0->-r requirements.txt (line 3)) (2023.9.2)
Requirement already satisfied: zipp>=0.5 in /home/hayley/.local/lib/python3.8/site-packages (from importlib-metadata->diffusers==0.15.0->-r requirements.txt (line 4)) (3.17.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/hayley/.local/lib/python3.8/site-packages (from jinja2->torch->-r requirements.txt (line 2)) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/hayley/.local/lib/python3.8/site-packages (from requests->transformers==4.31.0->-r requirements.txt (line 3)) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /home/hayley/.local/lib/python3.8/site-packages (from requests->transformers==4.31.0->-r requirements.txt (line 3)) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/hayley/.local/lib/python3.8/site-packages (from requests->transformers==4.31.0->-r requirements.txt (line 3)) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /home/hayley/.local/lib/python3.8/site-packages (from requests->transformers==4.31.0->-r requirements.txt (line 3)) (2023.7.22)
Requirement already satisfied: mpmath>=0.19 in /home/hayley/.local/lib/python3.8/site-packages (from sympy->torch->-r requirements.txt (line 2)) (1.3.0)
DEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: mlnx-tools -5.8.0- has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of mlnx-tools or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063

[notice] A new release of pip is available: 23.2.1 -> 23.3
[notice] To update, run: pip install --upgrade pip
-- The CXX compiler identification is GNU 9.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- NVTX is disabled
-- Importing batch manager
-- Building PyTorch
-- Building Google tests
-- Building benchmarks
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /usr/local/cuda/bin/nvcc
-- CUDA compiler: /usr/local/cuda/bin/nvcc
-- GPU architectures: 70-real;80-real;86-real;89-real;90-real
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- ========================= Importing and creating target nvinfer ==========================
-- Looking for library nvinfer
-- Library that was found nvinfer_LIB_PATH-NOTFOUND
-- ==========================================================================================
-- ========================= Importing and creating target nvuffparser ==========================
-- Looking for library nvparsers
-- Library that was found nvparsers_LIB_PATH-NOTFOUND
-- ==========================================================================================
-- CUDAToolkit_VERSION 12.2 is greater or equal than 11.0, enable -DENABLE_BF16 flag
-- CUDAToolkit_VERSION 12.2 is greater or equal than 11.8, enable -DENABLE_FP8 flag
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- COMMON_HEADER_DIRS: /home/hayley/tensorrtllm_backend/tensorrt_llm/cpp;/usr/local/cuda/include
-- TORCH_CUDA_ARCH_LIST: 7.0;8.0;8.6;8.9;9.0
-- Found Python3: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter Development Development.Module Development.Embed
-- Found Python executable at /home/hayley/llm_serving/.venv/bin/python3
-- Found Python libraries at /usr/lib/x86_64-linux-gnu
-- Found CUDA: /usr/local/cuda (found version "12.2")
-- Caffe2: CUDA detected: 12.2
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 12.2
CMake Warning at /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:166 (message):
  Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
  /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:281 (find_package)


-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90
CMake Warning at /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:281 (find_package)


-- Found Torch: /home/hayley/llm_serving/.venv/lib/python3.8/site-packages/torch/lib/libtorch.so
-- TORCH_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0
CMake Error at CMakeLists.txt:288 (file):
  file STRINGS file "/include/NvInferVersion.h" cannot be read.


CMake Error at CMakeLists.txt:291 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:293 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:291 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:293 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:291 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:293 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:291 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:293 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:297 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:299 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:297 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:299 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:297 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


CMake Error at CMakeLists.txt:299 (string):
  string sub-command REGEX, mode MATCH needs at least 5 arguments total to
  command.


-- Building for TensorRT version: .., library version:
-- Using MPI_CXX_INCLUDE_DIRS: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi;/usr/lib/x86_64-linux-gnu/openmpi/include
-- Using MPI_CXX_LIBRARIES: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so;/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- USE_CXX11_ABI: False
CMake Error at tensorrt_llm/plugins/CMakeLists.txt:106 (set_target_properties):
  set_target_properties called with incorrect number of arguments.


-- The C compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- ========================= Importing and creating target nvonnxparser ==========================
-- Looking for library nvonnxparser
-- Library that was found nvonnxparser_LIB_PATH-NOTFOUND
-- ==========================================================================================
-- Configuring incomplete, errors occurred!
Traceback (most recent call last):
  File "scripts/build_wheel.py", line 248, in <module>
    main(**vars(args))
  File "scripts/build_wheel.py", line 149, in main
    build_run(
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'cmake -DCMAKE_BUILD_TYPE="Release" -DBUILD_PYT="ON"  -DTRT_LIB_DIR=""/targets/x86_64-linux-gnu/lib -DTRT_INCLUDE_DIR=""/include -S "/home/hayley/tensorrtllm_backend/tensorrt_llm/cpp"' returned non-zero exit status 1.

Any ideas?

Accumulated decoding when streaming

I'm trying to serve a Llama-2-70b-chat-hf model using Triton inferencer server with TRT-LLM engine. The script I used is tools/inflight_batcher_llm/end_to_end_streaming_client.py:

python3 tools/inflight_batcher_llm/end_to_end_streaming_client.py -p "What is deep learning?" -S -o 64

This script streams the generated tokens in byte. I changed the callback function so that it would print strings:

print(output[0].decode(), flush=True, end="")

However, the output becomes:

Deeplearningisasubsetofmachinelearningthatinvolvestheuseofartificialneuralnetworkstomodelandsolvecomplexproblems.Inadeeplearningsystem,therearetypicallymultiplelayersofneuralnetworksthatprocessandtransformthedatainahierarchicalmanner.Eachlayerbuildsonthepreviousone,allowingthesystem

We can see that the spaces are gone. This is because the postprocess model in ensemble model decodes tokens one-by-one. In order to have correct spacing, we should do tokenizer.decode(accumulated_tokens) instead of tokenizer.decode(this_token), and only output the delta texts in postprocess model. However, I have no idea how to maintain the status in the postprocess model as all models in the ensemble model forms a single forward, stateless function.

One solution I could think of is removing the postprocess from ensemble model and let the client use tokenizer to decode the tokens. However, this doesn't make sense because it requires the client to know and load the tokenizer of the model it talks to.

Update README to include steps for using the TRT LLM Triton Backend container.

For example, does this step need to be done in the tritonserver:23.10-trtllm-python-py3 container?

# TensorRT-LLM is required for generating engines. You can skip this step if
# you already have the package installed. If you are generating engines within
# the Triton container, you have to install the TRT-LLM package.
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/

Why the outputTokensPerSecond is much smaller than Fastertransformer?

I have used Triton Server + FT in the past, and now I use Triton Server + TensorRT-LLM with inflight-batching, but the outputTokensPerSecond between them having a big gap.

max_new_tokens: 256
tp: 4
model: codeLlama-7b

The Triton Server of TensorRT-LLM is:

nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

The configurations of TensorRT-LLM is:

https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0/all_models/inflight_batcher_llm

Builds TensorRT engine(s) from HF is:

python build.py --model_dir ./META-CodeLlama-7b-hf/  \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /tensorrtllm_backend/trt_llama_7b_fp16/4-gpu/  \
                --world_size 4 \
                --tp_size 4

and the outputTokensPerSecond is as following:

FT: 1800
TensorRT-LLM: 470

This difference is so weird, I don’t know what the problem is.

Param "stop_words" not respected in v2/models/ensemble/generate endpoint

Hi, it doesn't seem like "stop_words" is respected in the generate endpoint.

I'm getting the same output with and without this field

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "branch"}'
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"<s> What is machine learning? Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine"}

not sure if I should supply a list so tried that as well

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ["branch"]}'
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"<s> What is machine learning? Machine learning is a branch of artificial intelligence that allows computers to learn without being explicitly programmed. Machine"}

[0.5.0][Bug] Release build failure

The library built from the offiical container breaks

ldd -r libtriton_tensorrtllm.so
	linux-vdso.so.1 (0x00007ffc3d582000)
	libtritonserver.so => /app/inflight_batcher_llm/build/_deps/repo-core-build/libtritonserver.so (0x00007f5237d62000)
	libmpi.so.40 => /opt/hpcx/ompi/lib/libmpi.so.40 (0x00007f5237c43000)
	libnvinfer.so.9 => /usr/local/tensorrt/lib/libnvinfer.so.9 (0x00007f52282ce000)
	libnvinfer_plugin_tensorrt_llm.so.9 => /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so.9 (0x00007f51f1463000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f51f122d000)
	libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f51f120d000)
	libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x00007f51f0fe5000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f523d51c000)
	libopen-rte.so.40 => /opt/hpcx/ompi/lib/libopen-rte.so.40 (0x00007f51f0f26000)
	libopen-pal.so.40 => /opt/hpcx/ompi/lib/libopen-pal.so.40 (0x00007f51f0e0d000)
	libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x00007f51f0d26000)
	libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f51f0d21000)
	libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x00007f51f0d1c000)
	librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x00007f51f0d17000)
	libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f51ea400000)
	libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f51c7400000)
	libcuda.so.1 => /usr/local/cuda/compat/lib.real/libcuda.so.1 (0x00007f51c57c1000)
	libnccl.so.2 => /usr/lib/x86_64-linux-gnu/libnccl.so.2 (0x00007f51b4b65000)
	libz.so.1 => /usr/lib/x86_64-linux-gnu/libz.so.1 (0x00007f51f0cf9000)
undefined symbol: cudaMemcpyAsync	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMallocAsync	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamDestroy	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamWaitEvent	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphExecDestroy	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGetDeviceCount	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemsetAsync	(./libtriton_tensorrtllm.so)
undefined symbol: cudaLaunchKernel	(./libtriton_tensorrtllm.so)
undefined symbol: cudaFree	(./libtriton_tensorrtllm.so)
undefined symbol: cudaEventSynchronize	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGetDeviceProperties_v2	(./libtriton_tensorrtllm.so)
undefined symbol: cudaPeekAtLastError	(./libtriton_tensorrtllm.so)
undefined symbol: cudaFuncSetAttribute	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphExecUpdate	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGetDevice	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemset	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphUpload	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaRegisterFunction	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamEndCapture	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceEnablePeerAccess	(./libtriton_tensorrtllm.so)
undefined symbol: cudaEventRecord	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceGetAttribute	(./libtriton_tensorrtllm.so)
undefined symbol: cudaFuncGetAttributes	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphLaunch	(./libtriton_tensorrtllm.so)
undefined symbol: cudaFreeAsync	(./libtriton_tensorrtllm.so)
undefined symbol: cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags	(./libtriton_tensorrtllm.so)
undefined symbol: cudaIpcCloseMemHandle	(./libtriton_tensorrtllm.so)
undefined symbol: cudaFreeHost	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaRegisterFatBinaryEnd	(./libtriton_tensorrtllm.so)
undefined symbol: cudaIpcOpenMemHandle	(./libtriton_tensorrtllm.so)
undefined symbol: cudaEventDestroy	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphDestroy	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceCanAccessPeer	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemPoolSetAttribute	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceGetDefaultMemPool	(./libtriton_tensorrtllm.so)
undefined symbol: cudaEventCreateWithFlags	(./libtriton_tensorrtllm.so)
undefined symbol: cudaSetDevice	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemGetInfo	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceSynchronize	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaUnregisterFatBinary	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamSynchronize	(./libtriton_tensorrtllm.so)
undefined symbol: cudaPointerGetAttributes	(./libtriton_tensorrtllm.so)
undefined symbol: cudaHostAlloc	(./libtriton_tensorrtllm.so)
undefined symbol: cudaIpcGetMemHandle	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaRegisterFatBinary	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamCreateWithPriority	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaPopCallConfiguration	(./libtriton_tensorrtllm.so)
undefined symbol: cudaStreamBeginCapture	(./libtriton_tensorrtllm.so)
undefined symbol: cudaDeviceDisablePeerAccess	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMalloc	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemcpy	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGraphInstantiate	(./libtriton_tensorrtllm.so)
undefined symbol: cudaMemPoolSetAccess	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaRegisterVar	(./libtriton_tensorrtllm.so)
undefined symbol: __cudaPushCallConfiguration	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGetLastError	(./libtriton_tensorrtllm.so)
undefined symbol: cudaGetErrorString	(./libtriton_tensorrtllm.so)

Step to reproduce

TRT_LLM_VERSION=release/0.5.0

pushd /tmp

git clone https://github.com/NVIDIA/TensorRT-LLM.git -b ${TRT_LLM_VERSION} --recursive
git clone https://github.com/triton-inference-server/tensorrtllm_backend -b ${TRT_LLM_VERSION}

rm -rf tensorrtllm_backend/tensorrt_llm
mv TensorRT-LLM tensorrtllm_backend/tensorrt_llm
cd tensorrtllm_backend/tensorrt_llm

pushd /tmp/tensorrtllm_backend/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu
rm -rf *
wget https://github.com/NVIDIA/TensorRT-LLM/raw/release/0.5.0/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a
wget https://github.com/NVIDIA/TensorRT-LLM/raw/release/0.5.0/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.pre_cxx11.a
popd

cd tensorrtllm_backend
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
popd

TensorRT Model metrics not showing in triton metrics endpoint

I am running the "in-flight-batching" example with a llama13B-chat based model compiled with tp=2

triton-inference-server tensorrt_llm backend was compiled with instructions for version 0.5.0

Inference is working but I am finding that all metric values on the /metrics endpoint for the "tensorrt_llm" model have values of 0 even though "preprocessing" and "postprocessing" values are incrementing and processed outputs are correct.

Is there something I need to do in order to enable metrics to be tracked by triton?

How to get the parameters introduction in config.pbtxt and the relationship between them?

When I just want to know the meaning of the parameter introduction in config.pbtxt, I do not know where the introduction is.

For example, I want to open inflight_batching, but I can not find a place about the introduction of inflight_batching until I found this https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/inflight_batcher_llm/README.md.

This is actually very inconvenient, especially for some new features. It would be nice if a special place could be provided to introduce these parameters.

Thanks.

Problems when deploying Triton Server according to the readme

image
image

I have completed the image build according to option 3. Next, I need to prepare the TensorRT-LLM engines according to the readme. Do I need to execute 'git submodule update --init --recursive' inside the container or outside the container?

Currently, I am executing the above 'prepare' related code in a container. When I execute 'python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 4 --storage-type float16', the following error occurs:

image

In addition, I did not run 'git submodule update --init --recursive' in the container because of the following error:

/app# git submodule update --init --recursive fatal: not a git repository (or any of the parent directories): .git

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.