root@GPU-26:/tensorrtllm_backend/tensorrtllm_backend/tensorrtllm_backend# CUDA_VISIBLE_DEVICES=0,3 python3 ./scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo
root@GPU-26:/tensorrtllm_backend/tensorrtllm_backend/tensorrtllm_backend# I1030 09:08:30.450984 1398 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f3edc000000' with size 268435456
I1030 09:08:30.454704 1398 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1030 09:08:30.454713 1398 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1030 09:08:30.472889 1399 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ff556000000' with size 268435456
I1030 09:08:30.492939 1399 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1030 09:08:30.492958 1399 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I1030 09:08:30.796900 1398 model_lifecycle.cc:461] loading: tensorrt_llm:2
I1030 09:08:30.796944 1398 model_lifecycle.cc:461] loading: preprocessing:1
I1030 09:08:30.796963 1398 model_lifecycle.cc:461] loading: postprocessing:1
I1030 09:08:30.815125 1399 model_lifecycle.cc:461] loading: tensorrt_llm:2
I1030 09:08:30.815165 1399 model_lifecycle.cc:461] loading: preprocessing:1
I1030 09:08:30.815184 1399 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] Cannot find parameter with name: batch_scheduler_policy
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1030 09:08:30.880197 1398 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1030 09:08:30.880584 1398 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] Cannot find parameter with name: batch_scheduler_policy
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1030 09:08:30.897152 1399 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1030 09:08:30.897646 1399 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
I1030 09:08:31.500529 1399 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
I1030 09:08:31.503903 1398 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
E1030 09:08:31.669733 1398 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
E1030 09:08:31.669950 1398 model_lifecycle.cc:621] failed to load 'postprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
I1030 09:08:31.669989 1398 model_lifecycle.cc:756] failed to load 'postprocessing'
E1030 09:08:31.686577 1399 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
E1030 09:08:31.686780 1399 model_lifecycle.cc:621] failed to load 'postprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
I1030 09:08:31.686815 1399 model_lifecycle.cc:756] failed to load 'postprocessing'
I1030 09:08:32.939433 1398 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
I1030 09:08:32.948539 1399 pb_stub.cc:325] Failed to initialize Python stub: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
E1030 09:08:33.484284 1399 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
E1030 09:08:33.484479 1399 model_lifecycle.cc:621] failed to load 'preprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
I1030 09:08:33.484529 1399 model_lifecycle.cc:756] failed to load 'preprocessing'
E1030 09:08:33.497454 1398 backend_model.cc:634] ERROR: Failed to create instance: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
E1030 09:08:33.497550 1398 model_lifecycle.cc:621] failed to load 'preprocessing' version 1: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize
I1030 09:08:33.497571 1398 model_lifecycle.cc:756] failed to load 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 7653 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 7653 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8683, GPU 26856 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8685, GPU 26866 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 8683, GPU 56434 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 8685, GPU 56444 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +7649, now: CPU 0, GPU 7649 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +7649, now: CPU 0, GPU 7649 (MiB)
E1030 09:08:44.006947 1398 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1030 09:08:44.007090 1398 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 2: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1030 09:08:44.007124 1398 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1030 09:08:44.007384 1398 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'postprocessing' which has no loaded version. Model 'postprocessing' loading failed with error: version 1 is at UNAVAILABLE state: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
;
I1030 09:08:44.007619 1398 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1030 09:08:44.007837 1398 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
I1030 09:08:44.008106 1398 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize |
| preprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize |
| tensorrt_llm | 2 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and pa |
| | | ged KV cache. |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
E1030 09:08:44.013076 1399 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
E1030 09:08:44.013151 1399 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 2: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and paged KV cache.
I1030 09:08:44.013168 1399 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
E1030 09:08:44.013322 1399 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'postprocessing' which has no loaded version. Model 'postprocessing' loading failed with error: version 1 is at UNAVAILABLE state: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported.
At:
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained
/tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize
;
I1030 09:08:44.013473 1399 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1030 09:08:44.013625 1399 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","shm-region-prefix-name":"prefix1_","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
I1030 09:08:44.013837 1399 server.cc:662]
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/postprocessing/1/model.py(65): initialize |
| preprocessing | 1 | UNAVAILABLE: Internal: ValueError: Tokenizer class BaichuanTokenizer does not exist or is not currently imported. |
| | | |
| | | At: |
| | | /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(748): from_pretrained |
| | | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(69): initialize |
| tensorrt_llm | 2 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: TrtGptModelInflightBatching requires GPT attention plugin with packed input and pa |
| | | ged KV cache. |
+----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
I1030 09:08:44.073052 1398 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800 80GB PCIe
I1030 09:08:44.073087 1398 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A800 80GB PCIe
I1030 09:08:44.076025 1398 metrics.cc:710] Collecting CPU metrics
I1030 09:08:44.076206 1398 tritonserver.cc:2458]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
I1030 09:08:44.076217 1398 server.cc:293] Waiting for in-flight requests to complete.
I1030 09:08:44.076221 1398 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1030 09:08:44.076228 1398 server.cc:324] All models are stopped, unloading models
I1030 09:08:44.076234 1398 server.cc:331] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
I1030 09:08:44.077887 1399 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800 80GB PCIe
I1030 09:08:44.077913 1399 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A800 80GB PCIe
I1030 09:08:44.078310 1399 metrics.cc:710] Collecting CPU metrics
I1030 09:08:44.078464 1399 tritonserver.cc:2458]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /tensorrtllm_backend/tensorrtllm_backend/triton_model_repo |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
I1030 09:08:44.078473 1399 server.cc:293] Waiting for in-flight requests to complete.
I1030 09:08:44.078477 1399 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1030 09:08:44.078484 1399 server.cc:324] All models are stopped, unloading models
I1030 09:08:44.078488 1399 server.cc:331] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[9409,1],0]
Exit code: 1
===========
i use baichuan2-13b