Giter Site home page Giter Site logo

triton-inference-server / client Goto Github PK

View Code? Open in Web Editor NEW
516.0 15.0 224.0 4.21 MB

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

License: BSD 3-Clause "New" or "Revised" License

CMake 2.78% C++ 65.21% C 0.22% Shell 0.25% Go 0.26% Java 2.76% Scala 0.24% Python 28.14% JavaScript 0.15%

client's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

client's Issues

input_data

HI,

should i need to create json with all possible inputs present in my all config.pbtxt?

if i have two models and in each model i have 2 inputs . so in the input_data.json should i place sample input data for all 4 inputs or what?

thanks

Revert urllib3 version pin

Hi. We are trying to integrate with OIP model servers over at feast feature store and need to add mlserver and tritonclient as optional dependencies. The problem is that we also already depend on snowflake-connector-python which still has a strict urllib3<2.0.0 requirement for python 3.9. I saw that urllib3 version pin here was only to avoid vulnerability reports. #457 Not sure which vulnerability that was referring to, but seems like urlib3 plan to ship security fixes for v1 still. Is is possible to revert the version pin? Or maybe allow something like (>=1.26.18<2 or >=2.0.7). Thanks

DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.

Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.

Workaround
To use torch 1.12 version. I tested and it's working fine

Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27

To Reproduce

class DLPackTest(unittest.TestCase):
    """
    Testing DLPack implementation in CUDA shared memory utilities
    """

    def test_from_gpu(self):
        # Create GPU tensor via PyTorch and CUDA shared memory region with
        # enough space
        tensor_shape = (1,2,4)
        gpu_tensor = torch.ones(tensor_shape).cuda(0)
        byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()

        shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)

        # Set data from DLPack specification of PyTorch tensor
        cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])

        # Make sure the DLPack specification of the shared memory region can
        # be consumed by PyTorch
        smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32",  tensor_shape)
        generated_torch_tensor = torch.from_dlpack(smt)
        self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))

        cudashm.destroy_shared_memory_region(shm_handle)

Performance Analyzer cannot collect metrics on Jetson Xavier

I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:

WARNING: Unable to parse ‘nv_gpu_utilization’ metric.
WARNING: Unable to parse ‘nv_gpu_power_usage’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_used_bytes’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_total_bytes’ metric.

The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1

Is it possible to solve this problem?

Thanks.

make cc-clients: Could not find requested file: RapidJSON-targets.cmake

cmake is not successful

 ❯ cmake --version
cmake version 3.21.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients


...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2

How do I get genai-perf to analyze my defined data set

My analysis command

genai-perf \
  -m codex-py-mr \
  --service-kind triton \
  --backend tensorrtllm  \
  --tokenizer /mnt/models/source/tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --artifact-dir artifacts2 \
  --profile-export-file my_profile_export.json \
  --url 0.0.0.0:9001 \
  --input-file /data/script/llm_inputs.json \
  --generate-plots

The /data/script/ llm_input.json file contains only one string input
I found that when I specify --input-file, genai-perf helps me wrap the contents of the LLm_input-json file into the following format:

{
  "data": [
    {
    "text_input": [
      "${question}"
    ].
    "max_tokens": [
      256
    ].
      "model": "codex-py-mr"
    }
  ]
}

My question

  1. Why --input-file can only define a single prompt, can i define multiple prompt?

  2. Can the wrapped model input key of genai-perf only be text_input? Can I customize the value of this key?

  3. --input-dataset Can only use the HuggingFace dataset? Can I pass in a custom dataset?

Incomplete installation of all genai-perf dependencies prevents its from being run on air-gapped servers

When genai-perf is installed using pip from Github (as documented), on first run it tries to download several files from Huggingface, like this:

$ docker run --rm -it --name test -u 0 gpu-tritonserver-tst:latest bash -c "genai-perf --help"
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 5.45MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 78.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.25MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.72MB/s]
usage: genai-perf [-h] [--expected-output-tokens EXPECTED_OUTPUT_TOKENS] [--input-type {url,file,synthetic}] [--input-tokens-mean INPUT_TOKENS_MEAN] [--input-tokens-stddev INPUT_TOKENS_STDDEV] -m MODEL
                  [--num-of-output-prompts NUM_OF_OUTPUT_PROMPTS] [--output-format {openai_chat_completions,openai_completions,trtllm,vllm}] [--random-seed RANDOM_SEED] [--concurrency CONCURRENCY]
                  [--input-data INPUT_DATA] [-p MEASUREMENT_INTERVAL] [--profile-export-file PROFILE_EXPORT_FILE] [--request-rate REQUEST_RATE] [--service-kind {triton,openai}] [-s STABILITY_PERCENTAGE]
                  [--streaming] [-v] [--version] [--endpoint ENDPOINT] [-u URL] [--dataset {openorca,cnn_dailymail}]

CLI to profile LLMs and Generative AI models with Perf Analyzer
[..]

This "calling-home" behavior will prevent genai-perf from running correctly in corporate air-gapped environments. All required dependencies need to be collected by Python or Bash scripts at install time (which can occur on a different sever, such as an internet-connected build server(s) where most pull actions are permitted) rather than being pulled upon the first run of the program.

Memory leak from grpcio

I tested tritonclient:2.43.0 on Ubuntu:22.04 with grpcio:1.62.1 and was confronted with a memory leak. Example for reproduction:

import asyncio
from tritonclient.grpc.aio import InferenceServerClient

async def get_triton_client():
    return InferenceServerClient(url='127.0.0.1:8002', verbose=True)

if __name__ == "__main__":
    while(True):
        print(asyncio.run(get_triton_client()))

The problem is reproduced on the latest versions tritonclient:2.47.0 and grpcio:1.64.1.
I have found a solution to avoid it by using this version grpcio:1.58.0.

As I understand it, tritonclient has a warning to avoid using some versions of grpcio:

# Check grpc version and issue warnings if grpc version is known to have

Could you confirm it?

AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Here is my client code:

from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient

if __name__ == "__main__":

    triton_client = httpclient.InferenceServerClient(url="localhost:8003")

    # Example of tracing an image processing:
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    image = Image.open("9997103.jpg").convert('RGB')

    inputs = processor(images=image, return_tensors="pt")['pixel_values']

    inputs = []
    outputs = []

    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
    inputs[0].set_data_from_numpy(image)

    outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))

    results = triton_client.infer(
        model_name='clip',
        inputs=inputs,
        outputs=outputs,
    )

    print(results.as_numpy("output__0"))

Here's the error I'm getting

Traceback (most recent call last):
  File "main.py", line 18, in <module>
    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Help me please

Make perf_analyzer work on macbook

It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?

Any example of triton-vllm in c++?

I use the tc::InferenceServerGrpcClient.Infer but got the error of not supported support models with decoupled transaction policy.
when I use tc::InferenceServerGrpcClient.AsyncInfer, no error message but no request in the server. but the server log has error: Infer failed: ModelInfer RPC doesn't support models with decoupled transaction policy
which API should I use?

Can not import GRPC Tritonclient in Seldon MLServer

Description
I have a need to use tritonclient within a SeldonMLServer. But when I import the library into the server, i get this error:

TypeError: Couldn't build proto file into descriptor pool: duplicate symbol 'inference.ServerLiveRequest'

My best guess is both MLServer and tritonclient is importing the same proto file that follow KServe API v2 spec that causes a symbol name collision.

To Reproduce

  • import tritonclient.grpc
  • start MLServer

Expected behavior
Be able to import tritonclient in MLServer, perhaps a namespace to prevent symbol collision among Kserve API clients should do the trick. I am not a grpc expert but I expect some sort of that mechanism.

Add support for FetchContent or find_package

Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.

find_package

Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Alternatively install tritonclient globally. In any case, the following minimal cmake fails:

# tritonclient
find_package(TritonCommon REQUIRED)
find_package(TritonClient REQUIRED)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:43 (add_executable):
  Target "test" links to target "protobuf::libprotobuf" but the target was
  not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

FetchContent

# tritonclient
FetchContent_Declare(
    tritonclient
    GIT_REPOSITORY https://github.com/triton-inference-server/client
    GIT_TAG r24.04
)
set(TRITON_ENABLE_CC_GRPC ON)
set(TRITON_COMMON_REPO_TAG r24.04)
set(TRITON_THIRD_PARTY_REPO_TAG r24.04)
set(TRITON_CORE_REPO_TAG r24.04)
FetchContent_MakeAvailable(tritonclient)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:55 (add_executable):
  Target "test" links to target "TritonClient::grpcclient" but the target
  was not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

What is the recommended way to create a standalone app with tritonclient dependency currently? It seems that for find_package the link targets are broken and for FetchContent there is no support at all because no targets are exported.

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unable to open shared memory region: '/output_simple'

I run the src/python/examples/simple_grpc_shm_client.py file ,but got error below
Traceback (most recent call last):
File "demo.py", line 95, in
triton_client.register_system_shared_memory(
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_client.py", line 1223, in register_system_shared_memory
raise_error_grpc(rpc_error)
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unable to open shared memory region: '/output_simple'
how can i fix this?

urllib dependency is present when using [grpc, cuda] options

urllib is only used for the http backend based on my look at the code, but the top level requirements.txt list urllib as a dependency. However, it should be only in the requirements_http.txt and not in the top level requirements.txt.

This creates unnecessary conflicts when one only wants to depend on grpc and not http backend. Dependency conflicts like the one reported here: #648

Memory leak in SharedMemoryTensor.__dlpack__

Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.

import torch
import tritonclient.utils.cuda_shared_memory as cudashm

n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
    cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
    smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
    generated_torch_tensor = torch.from_dlpack(smt)

The leak occurs when the dlpack function is called in torch.from_dlpack(smt)

For goalng grpc client, call ModelInfer interface, how to parse useful values from ModelInferResponse?

ModelInferResponse is defined as follows

type ModelInferResponse struct {
        // ohter ...

	Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
	RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}

Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?

[QUESTION] - Tensorflow python infer to triton client

Newby question.. I have a tensorflow code that creates a PROTO PredictRequest class. Is there a python wrapper/utility code to automatically convert it to InferInput objects to be later on used in :

inputs.append(grpcclient.InferInput( ... ))
inputs[1].set_data_from_numpy(input1_data)

genai-perf KeyError: 'service_kind'

environment

pip install "git+https://github.com/triton-inference-server/[email protected]#subdirectory=src/c++/perf_analyzer/genai-perf"
genai-perf==0.0.3

command

genai-perf \
  -m codex-py-mr \
  --service-kind triton \
  --backend tensorrtllm  \
  --tokenizer /mnt/models/source/tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --artifact-dir artifacts \
  --profile-export-file my_profile_export.json \
  --url 0.0.0.0:9001 \
  --input-file /data/script/llm_inputs.json \
  --extra-inputs "temperature":0.1 \
  --extra-inputs "top_p":1.0 \
  --extra-inputs "top_k":50 \
  --extra-inputs "random_seed":809451

Run log

2024-06-13 03:51 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m codex-py-mr --async --input-data artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/llm_inputs.json --service-kind triton -u 0.0.0.0:9001 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/my_profile_export.json -i grpc --streaming --shape max_tokens:1 --shape text_input:1 --concurrency-range 1'
WARNING: Pass contained only one request, so sample latency standard deviation will be infinity (UINT64_MAX).
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 144, in run
    data_parser = calculate_metrics(args, tokenizer)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 87, in calculate_metrics
    return LLMProfileDataParser(
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 533, in __init__
    super().__init__(filename)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 445, in __init__
    self._get_profile_metadata(data)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 449, in _get_profile_metadata
    self._service_kind = data["service_kind"]
KeyError: 'service_kind'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 154, in main
    run()
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 147, in run
    raise GenAIPerfException(e)
genai_perf.exceptions.GenAIPerfException: 'service_kind'
2024-06-13 03:51 [ERROR] genai_perf.main:158 - 'service_kind'

Converting InferenceRequest to InferInput

I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.

Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.

for input_name in self.input_names[1:]:
                data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
                    .as_numpy()\
                    .reshape(-1)
                input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
                input_.set_data_from_numpy(data_)
                inputs.append(input_)

However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.

Is there a way to convert InferenceRequest to InferInput more efficiently?

Thanks!

Unable to use triton client with shared memory in C++ (Jetpack 6 device)

I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h and defined in shm_utils.cc. However, the header is not found in Triton Client's include directory leading to a compilation error.

On making the following changes to src/c++/library/CMakeLists.txt and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)

@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
   # libgrpcclient object build
   set(
       REQUEST_SRCS
-      grpc_client.cc common.cc
+      grpc_client.cc common.cc shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      grpc_client.h common.h ipc.h
+      grpc_client.h common.h ipc.h shm_utils.h
   )
 
   add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
   # libhttpclient object build
   set(
       REQUEST_SRCS
-      http_client.cc common.cc cencode.c
+      http_client.cc common.cc cencode.c shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      http_client.h common.h ipc.h cencode.h
+      http_client.h common.h ipc.h cencode.h shm_utils.h
   )
 
   add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
       FILES
       ${CMAKE_CURRENT_SOURCE_DIR}/common.h
       ${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+      ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
       DESTINATION include
   )
 

Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?

Delayed aio infer during burst requests

Background

I have a simple FastAPI app that will call Triton server using Triton client. It works great for small, non-rapid requests. But the behavior starts to decline when I introduce a lot of concurrent requests, such as when using JMeter. Originally I thought the bottleneck was on my Triton server, but turns out it happened on the client (FastAPI app). Both Triton server and client are on my local PC.

Reproduce attempt

Below are all components required to reproduce this problem.

# Repo structure
.
├── model_repository
│   └── dummy
│       ├── 1
│       │   └── model.py
│       └── config.pbtxt
<snip>
# config.pbtxt

backend: "python"
max_batch_size: 10

dynamic_batching {}

input [
  {
    name: "image"
    data_type: TYPE_FP32
    dims: [3, 1000, 1000]
  }
]

output [
  {
    name: "result"
    data_type: TYPE_INT32
    dims: [5]
  }
]
# model.py
import time

import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    def execute(self, requests):
        n = len(requests)
        print(f"received {n} requests, at {time.time()}", flush=True)
        # always return dummy tensor
        responses = []
        for i in range(n):
            responses.append(
                pb_utils.InferenceResponse(
                    [pb_utils.Tensor("result", np.arange(5, dtype=np.int32))]
                )
            )
        return responses
# http_client_aio.py
import asyncio

import numpy as np
import tritonclient.http.aio as httpclient


async def main():
    client = httpclient.InferenceServerClient(url="localhost:8000")
    fake_image = np.random.random((1, 3, 1000, 1000)).astype(np.float32)

    tasks = []
    for i in range(500):
        fake_input = httpclient.InferInput("image", fake_image.shape, "FP32")
        fake_input.set_data_from_numpy(fake_image)
        task = client.infer("dummy", [fake_input])
        tasks.append(task)
    print("done sending")

    await asyncio.gather(*tasks)
    await client.close()


if __name__ == "__main__":
    asyncio.run(main())

Docker commands for Triton server.

docker run --shm-size=1gb --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models

I opened 2 terminals, run docker commands on the first and http_client_aio.py on the second window. Here is what happened:

  1. Start run http_client_aio.py
  2. "done sending" printed after ~2 secs.
  3. There are ~3 secs of delay which nothing seems to happen.
  4. Logs in Triton server (TritonPythonModel print statements) starts rolling in.

Questions

  1. Am I missing something?
  2. What happened on those ~3 secs delay?
  3. Can I saturate Triton server using just aio client with relatively big inference payload (I use 1000x1000 image resolution)?
  4. What are the best practices to handle this bottleneck?

How to ensure `load_model` applies to the same server pod as `infer`?

In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.

Now we do something like

service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)

if not triton_client.is_model_ready(model_name):
    triton_client.load_model(model_name)

triton_client.infer(
    model_name,
    model_version=model_version,
    inputs=triton_inputs,
    outputs=triton_outputs,
)

Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.