triton-inference-server / client Goto Github PK
View Code? Open in Web Editor NEWTriton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
License: BSD 3-Clause "New" or "Revised" License
Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
License: BSD 3-Clause "New" or "Revised" License
HI,
should i need to create json with all possible inputs present in my all config.pbtxt?
if i have two models and in each model i have 2 inputs . so in the input_data.json should i place sample input data for all 4 inputs or what?
thanks
I add grpcclient_static to target_link_libraries in cmakelist. But when compiling, there are link errors about grpc. Please tell me where is wrong, thank you!
Hi. We are trying to integrate with OIP model servers over at feast feature store and need to add mlserver and tritonclient as optional dependencies. The problem is that we also already depend on snowflake-connector-python
which still has a strict urllib3<2.0.0 requirement for python 3.9. I saw that urllib3 version pin here was only to avoid vulnerability reports. #457 Not sure which vulnerability that was referring to, but seems like urlib3 plan to ship security fixes for v1 still. Is is possible to revert the version pin? Or maybe allow something like (>=1.26.18<2 or >=2.0.7). Thanks
Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.
Workaround
To use torch 1.12 version. I tested and it's working fine
Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27
To Reproduce
class DLPackTest(unittest.TestCase):
"""
Testing DLPack implementation in CUDA shared memory utilities
"""
def test_from_gpu(self):
# Create GPU tensor via PyTorch and CUDA shared memory region with
# enough space
tensor_shape = (1,2,4)
gpu_tensor = torch.ones(tensor_shape).cuda(0)
byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
# Set data from DLPack specification of PyTorch tensor
cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])
# Make sure the DLPack specification of the shared memory region can
# be consumed by PyTorch
smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", tensor_shape)
generated_torch_tensor = torch.from_dlpack(smt)
self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))
cudashm.destroy_shared_memory_region(shm_handle)
I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:
WARNING: Unable to parse ‘nv_gpu_utilization’ metric.
WARNING: Unable to parse ‘nv_gpu_power_usage’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_used_bytes’ metric.
WARNING: Unable to parse ‘nv_gpu_memory_total_bytes’ metric.
The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1
Is it possible to solve this problem?
Thanks.
cmake is not successful
❯ cmake --version
cmake version 3.21.0
CMake suite maintained and supported by Kitware (kitware.com/cmake).
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients
...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
include could not find requested file:
/home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)
-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
include could not find requested file:
/home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
library/CMakeLists.txt:49 (find_package)
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
library/CMakeLists.txt:49 (find_package)
-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2
genai-perf \
-m codex-py-mr \
--service-kind triton \
--backend tensorrtllm \
--tokenizer /mnt/models/source/tokenizer \
--concurrency 1 \
--measurement-interval 4000 \
--artifact-dir artifacts2 \
--profile-export-file my_profile_export.json \
--url 0.0.0.0:9001 \
--input-file /data/script/llm_inputs.json \
--generate-plots
The /data/script/ llm_input.json
file contains only one string input
I found that when I specify --input-file
, genai-perf helps me wrap the contents of the LLm_input-json file into the following format:
{
"data": [
{
"text_input": [
"${question}"
].
"max_tokens": [
256
].
"model": "codex-py-mr"
}
]
}
Why --input-file
can only define a single prompt, can i define multiple prompt?
Can the wrapped model input key of genai-perf only be text_input
? Can I customize the value of this key?
--input-dataset
Can only use the HuggingFace dataset? Can I pass in a custom dataset?
When genai-perf
is installed using pip
from Github (as documented), on first run it tries to download several files from Huggingface, like this:
$ docker run --rm -it --name test -u 0 gpu-tritonserver-tst:latest bash -c "genai-perf --help"
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 5.45MB/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 78.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.25MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 2.72MB/s]
usage: genai-perf [-h] [--expected-output-tokens EXPECTED_OUTPUT_TOKENS] [--input-type {url,file,synthetic}] [--input-tokens-mean INPUT_TOKENS_MEAN] [--input-tokens-stddev INPUT_TOKENS_STDDEV] -m MODEL
[--num-of-output-prompts NUM_OF_OUTPUT_PROMPTS] [--output-format {openai_chat_completions,openai_completions,trtllm,vllm}] [--random-seed RANDOM_SEED] [--concurrency CONCURRENCY]
[--input-data INPUT_DATA] [-p MEASUREMENT_INTERVAL] [--profile-export-file PROFILE_EXPORT_FILE] [--request-rate REQUEST_RATE] [--service-kind {triton,openai}] [-s STABILITY_PERCENTAGE]
[--streaming] [-v] [--version] [--endpoint ENDPOINT] [-u URL] [--dataset {openorca,cnn_dailymail}]
CLI to profile LLMs and Generative AI models with Perf Analyzer
[..]
This "calling-home" behavior will prevent genai-perf
from running correctly in corporate air-gapped environments. All required dependencies need to be collected by Python or Bash scripts at install time (which can occur on a different sever, such as an internet-connected build server(s) where most pull actions are permitted) rather than being pulled upon the first run of the program.
I tested tritonclient:2.43.0
on Ubuntu:22.04 with grpcio:1.62.1
and was confronted with a memory leak. Example for reproduction:
import asyncio
from tritonclient.grpc.aio import InferenceServerClient
async def get_triton_client():
return InferenceServerClient(url='127.0.0.1:8002', verbose=True)
if __name__ == "__main__":
while(True):
print(asyncio.run(get_triton_client()))
The problem is reproduced on the latest versions tritonclient:2.47.0
and grpcio:1.64.1
.
I have found a solution to avoid it by using this version grpcio:1.58.0
.
As I understand it, tritonclient has a warning to avoid using some versions of grpcio:
Could you confirm it?
I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'
Here is my client code:
from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient
if __name__ == "__main__":
triton_client = httpclient.InferenceServerClient(url="localhost:8003")
# Example of tracing an image processing:
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("9997103.jpg").convert('RGB')
inputs = processor(images=image, return_tensors="pt")['pixel_values']
inputs = []
outputs = []
inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
inputs[0].set_data_from_numpy(image)
outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))
results = triton_client.infer(
model_name='clip',
inputs=inputs,
outputs=outputs,
)
print(results.as_numpy("output__0"))
Here's the error I'm getting
Traceback (most recent call last):
File "main.py", line 18, in <module>
inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'
Help me please
It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?
I use the tc::InferenceServerGrpcClient.Infer but got the error of not supported support models with decoupled transaction policy.
when I use tc::InferenceServerGrpcClient.AsyncInfer, no error message but no request in the server. but the server log has error: Infer failed: ModelInfer RPC doesn't support models with decoupled transaction policy
which API should I use?
Description
I have a need to use tritonclient within a SeldonMLServer. But when I import the library into the server, i get this error:
TypeError: Couldn't build proto file into descriptor pool: duplicate symbol 'inference.ServerLiveRequest'
My best guess is both MLServer and tritonclient is importing the same proto file that follow KServe API v2 spec that causes a symbol name collision.
To Reproduce
Expected behavior
Be able to import tritonclient in MLServer, perhaps a namespace to prevent symbol collision among Kserve API clients should do the trick. I am not a grpc expert but I expect some sort of that mechanism.
Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.
Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Alternatively install tritonclient globally. In any case, the following minimal cmake fails:
# tritonclient
find_package(TritonCommon REQUIRED)
find_package(TritonClient REQUIRED)
...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:43 (add_executable):
Target "test" links to target "protobuf::libprotobuf" but the target was
not found. Perhaps a find_package() call is missing for an IMPORTED
target, or an ALIAS target is missing?
# tritonclient
FetchContent_Declare(
tritonclient
GIT_REPOSITORY https://github.com/triton-inference-server/client
GIT_TAG r24.04
)
set(TRITON_ENABLE_CC_GRPC ON)
set(TRITON_COMMON_REPO_TAG r24.04)
set(TRITON_THIRD_PARTY_REPO_TAG r24.04)
set(TRITON_CORE_REPO_TAG r24.04)
FetchContent_MakeAvailable(tritonclient)
...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:55 (add_executable):
Target "test" links to target "TritonClient::grpcclient" but the target
was not found. Perhaps a find_package() call is missing for an IMPORTED
target, or an ALIAS target is missing?
What is the recommended way to create a standalone app with tritonclient dependency currently? It seems that for find_package the link targets are broken and for FetchContent there is no support at all because no targets are exported.
I run the src/python/examples/simple_grpc_shm_client.py file ,but got error below
Traceback (most recent call last):
File "demo.py", line 95, in
triton_client.register_system_shared_memory(
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_client.py", line 1223, in register_system_shared_memory
raise_error_grpc(rpc_error)
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unable to open shared memory region: '/output_simple'
how can i fix this?
urllib
is only used for the http
backend based on my look at the code, but the top level requirements.txt
list urllib
as a dependency. However, it should be only in the requirements_http.txt
and not in the top level requirements.txt
.
This creates unnecessary conflicts when one only wants to depend on grpc and not http backend. Dependency conflicts like the one reported here: #648
Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.
import torch
import tritonclient.utils.cuda_shared_memory as cudashm
n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
generated_torch_tensor = torch.from_dlpack(smt)
The leak occurs when the dlpack function is called in torch.from_dlpack(smt)
ModelInferResponse is defined as follows
type ModelInferResponse struct {
// ohter ...
Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}
Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?
Newby question.. I have a tensorflow code that creates a PROTO PredictRequest class. Is there a python wrapper/utility code to automatically convert it to InferInput objects to be later on used in :
inputs.append(grpcclient.InferInput( ... ))
inputs[1].set_data_from_numpy(input1_data)
pip install "git+https://github.com/triton-inference-server/[email protected]#subdirectory=src/c++/perf_analyzer/genai-perf"
genai-perf==0.0.3
genai-perf \
-m codex-py-mr \
--service-kind triton \
--backend tensorrtllm \
--tokenizer /mnt/models/source/tokenizer \
--concurrency 1 \
--measurement-interval 4000 \
--artifact-dir artifacts \
--profile-export-file my_profile_export.json \
--url 0.0.0.0:9001 \
--input-file /data/script/llm_inputs.json \
--extra-inputs "temperature":0.1 \
--extra-inputs "top_p":1.0 \
--extra-inputs "top_k":50 \
--extra-inputs "random_seed":809451
2024-06-13 03:51 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m codex-py-mr --async --input-data artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/llm_inputs.json --service-kind triton -u 0.0.0.0:9001 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/my_profile_export.json -i grpc --streaming --shape max_tokens:1 --shape text_input:1 --concurrency-range 1'
WARNING: Pass contained only one request, so sample latency standard deviation will be infinity (UINT64_MAX).
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 144, in run
data_parser = calculate_metrics(args, tokenizer)
File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 87, in calculate_metrics
return LLMProfileDataParser(
File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 533, in __init__
super().__init__(filename)
File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 445, in __init__
self._get_profile_metadata(data)
File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 449, in _get_profile_metadata
self._service_kind = data["service_kind"]
KeyError: 'service_kind'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 154, in main
run()
File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 147, in run
raise GenAIPerfException(e)
genai_perf.exceptions.GenAIPerfException: 'service_kind'
2024-06-13 03:51 [ERROR] genai_perf.main:158 - 'service_kind'
I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.
Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.
for input_name in self.input_names[1:]:
data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
.as_numpy()\
.reshape(-1)
input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
input_.set_data_from_numpy(data_)
inputs.append(input_)
However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.
Is there a way to convert InferenceRequest to InferInput more efficiently?
Thanks!
I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h
and defined in shm_utils.cc
. However, the header is not found in Triton Client's include directory leading to a compilation error.
On making the following changes to src/c++/library/CMakeLists.txt
and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)
@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
# libgrpcclient object build
set(
REQUEST_SRCS
- grpc_client.cc common.cc
+ grpc_client.cc common.cc shm_utils.cc
)
set(
REQUEST_HDRS
- grpc_client.h common.h ipc.h
+ grpc_client.h common.h ipc.h shm_utils.h
)
add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
# libhttpclient object build
set(
REQUEST_SRCS
- http_client.cc common.cc cencode.c
+ http_client.cc common.cc cencode.c shm_utils.cc
)
set(
REQUEST_HDRS
- http_client.h common.h ipc.h cencode.h
+ http_client.h common.h ipc.h cencode.h shm_utils.h
)
add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
FILES
${CMAKE_CURRENT_SOURCE_DIR}/common.h
${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+ ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
DESTINATION include
)
Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?
I have a simple FastAPI app that will call Triton server using Triton client. It works great for small, non-rapid requests. But the behavior starts to decline when I introduce a lot of concurrent requests, such as when using JMeter. Originally I thought the bottleneck was on my Triton server, but turns out it happened on the client (FastAPI app). Both Triton server and client are on my local PC.
Below are all components required to reproduce this problem.
# Repo structure
.
├── model_repository
│ └── dummy
│ ├── 1
│ │ └── model.py
│ └── config.pbtxt
<snip>
# config.pbtxt
backend: "python"
max_batch_size: 10
dynamic_batching {}
input [
{
name: "image"
data_type: TYPE_FP32
dims: [3, 1000, 1000]
}
]
output [
{
name: "result"
data_type: TYPE_INT32
dims: [5]
}
]
# model.py
import time
import numpy as np
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def execute(self, requests):
n = len(requests)
print(f"received {n} requests, at {time.time()}", flush=True)
# always return dummy tensor
responses = []
for i in range(n):
responses.append(
pb_utils.InferenceResponse(
[pb_utils.Tensor("result", np.arange(5, dtype=np.int32))]
)
)
return responses
# http_client_aio.py
import asyncio
import numpy as np
import tritonclient.http.aio as httpclient
async def main():
client = httpclient.InferenceServerClient(url="localhost:8000")
fake_image = np.random.random((1, 3, 1000, 1000)).astype(np.float32)
tasks = []
for i in range(500):
fake_input = httpclient.InferInput("image", fake_image.shape, "FP32")
fake_input.set_data_from_numpy(fake_image)
task = client.infer("dummy", [fake_input])
tasks.append(task)
print("done sending")
await asyncio.gather(*tasks)
await client.close()
if __name__ == "__main__":
asyncio.run(main())
Docker commands for Triton server.
docker run --shm-size=1gb --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models
I opened 2 terminals, run docker commands on the first and http_client_aio.py
on the second window. Here is what happened:
http_client_aio.py
In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.
Now we do something like
service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)
if not triton_client.is_model_ready(model_name):
triton_client.load_model(model_name)
triton_client.infer(
model_name,
model_version=model_version,
inputs=triton_inputs,
outputs=triton_outputs,
)
Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.