Giter Site home page Giter Site logo

triton-inference-server / client Goto Github PK

View Code? Open in Web Editor NEW
540.0 540.0 227.0 4.96 MB

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

License: BSD 3-Clause "New" or "Revised" License

CMake 4.53% C++ 43.14% C 0.52% Shell 0.60% Go 0.62% Java 6.62% Scala 0.57% Python 43.04% JavaScript 0.36%

client's Introduction

Triton Inference Server

๐Ÿ“ฃ vLLM x Triton Meetup at Fort Mason on Sept 9th 4:00 - 9:00 pm

We are excited to announce that we will be hosting our Triton user meetup with the vLLM team at Fort Mason on Sept 9th 4:00 - 9:00 pm. Join us for this exclusive event where you will learn about the newest vLLM and Triton features, get a glimpse into the roadmaps, and connect with fellow users, the NVIDIA Triton and vLLM teams. Seating is limited and registration confirmation is required to attend - please register here to join the meetup.


License

[!WARNING]

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.48.0 and corresponds to the 24.07 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.07 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.07-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Configure and Use Triton Inference Server

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

  • minimal โ€“ use as little code as possible that still produces the same problem
  • complete โ€“ provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
  • verifiable โ€“ test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

client's People

Contributors

aleksa2808 avatar andydai-nv avatar bencsikandrei avatar coderham avatar debermudez avatar dyastremsky avatar dzier avatar fpetrini15 avatar guanluo avatar indrajit96 avatar izzyputterman avatar jbkyang-nvi avatar joachimhgg avatar kpedro88 avatar krishung5 avatar kthui avatar lkomali avatar matthewkotila avatar mc-nv avatar nealvaidya avatar nnshah1 avatar nv-braf avatar nv-hwoo avatar nv-kmcgill53 avatar pvijayakrish avatar rmccorm4 avatar tabrizian avatar tanmayv25 avatar tgerdesnv avatar yinggeh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

client's Issues

input_data

HI,

should i need to create json with all possible inputs present in my all config.pbtxt?

if i have two models and in each model i have 2 inputs . so in the input_data.json should i place sample input data for all 4 inputs or what?

thanks

Make perf_analyzer work on macbook

It looks like perf_analyzer doesn't work on macbook, and pip install tritonclient doesn't include perf_analyzer, is this true? Is it possible to support it?

For goalng grpc client, call ModelInfer interface, how to parse useful values from ModelInferResponse๏ผŸ

ModelInferResponse is defined as follows

type ModelInferResponse struct {
        // ohter ...

	Outputs []*ModelInferResponse_InferOutputTensor `protobuf:"bytes,5,rep,name=outputs,proto3" json:"outputs,omitempty"`
	RawOutputContents [][]byte `protobuf:"bytes,6,rep,name=raw_output_contents,json=rawOutputContents,proto3" json:"raw_output_contents,omitempty"`
}

Call the ModelInfer interface to get the response. How to parse the contents of the RawOutputContents field according to the Datatype?
Also why is Outputs[i].Contents empty?

Memory leak from grpcio

I tested tritonclient:2.43.0 on Ubuntu:22.04 with grpcio:1.62.1 and was confronted with a memory leak. Example for reproduction:

import asyncio
from tritonclient.grpc.aio import InferenceServerClient

async def get_triton_client():
    return InferenceServerClient(url='127.0.0.1:8002', verbose=True)

if __name__ == "__main__":
    while(True):
        print(asyncio.run(get_triton_client()))

The problem is reproduced on the latest versions tritonclient:2.47.0 and grpcio:1.64.1.
I have found a solution to avoid it by using this version grpcio:1.58.0.

As I understand it, tritonclient has a warning to avoid using some versions of grpcio:

# Check grpc version and issue warnings if grpc version is known to have

Could you confirm it?

Can not import GRPC Tritonclient in Seldon MLServer

Description
I have a need to use tritonclient within a SeldonMLServer. But when I import the library into the server, i get this error:

TypeError: Couldn't build proto file into descriptor pool: duplicate symbol 'inference.ServerLiveRequest'

My best guess is both MLServer and tritonclient is importing the same proto file that follow KServe API v2 spec that causes a symbol name collision.

To Reproduce

  • import tritonclient.grpc
  • start MLServer

Expected behavior
Be able to import tritonclient in MLServer, perhaps a namespace to prevent symbol collision among Kserve API clients should do the trick. I am not a grpc expert but I expect some sort of that mechanism.

DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.

Description
test_cuda_shared_memory.py
failed when dimension of batch size is smaller than 2. I think this issue is from pytorch but I'm just wondering if there's any workaround to make it work with pytorch version >1.12.

Workaround
To use torch 1.12 version. I tested and it's working fine

Triton Information
What version of Triton client are you using?
compiled from latest version b0b5b27

To Reproduce

class DLPackTest(unittest.TestCase):
    """
    Testing DLPack implementation in CUDA shared memory utilities
    """

    def test_from_gpu(self):
        # Create GPU tensor via PyTorch and CUDA shared memory region with
        # enough space
        tensor_shape = (1,2,4)
        gpu_tensor = torch.ones(tensor_shape).cuda(0)
        byte_size = gpu_tensor.nelement() * gpu_tensor.element_size()

        shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)

        # Set data from DLPack specification of PyTorch tensor
        cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor])

        # Make sure the DLPack specification of the shared memory region can
        # be consumed by PyTorch
        smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32",  tensor_shape)
        generated_torch_tensor = torch.from_dlpack(smt)
        self.assertTrue(torch.allclose(gpu_tensor, generated_torch_tensor))

        cudashm.destroy_shared_memory_region(shm_handle)

How do I get genai-perf to analyze my defined data set

My analysis command

genai-perf \
  -m codex-py-mr \
  --service-kind triton \
  --backend tensorrtllm  \
  --tokenizer /mnt/models/source/tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --artifact-dir artifacts2 \
  --profile-export-file my_profile_export.json \
  --url 0.0.0.0:9001 \
  --input-file /data/script/llm_inputs.json \
  --generate-plots

The /data/script/ llm_input.json file contains only one string input
I found that when I specify --input-file, genai-perf helps me wrap the contents of the LLm_input-json file into the following format:

{
  "data": [
    {
    "text_input": [
      "${question}"
    ].
    "max_tokens": [
      256
    ].
      "model": "codex-py-mr"
    }
  ]
}

My question

  1. Why --input-file can only define a single prompt, can i define multiple prompt?

  2. Can the wrapped model input key of genai-perf only be text_input? Can I customize the value of this key?

  3. --input-dataset Can only use the HuggingFace dataset? Can I pass in a custom dataset?

[QUESTION] - Tensorflow python infer to triton client

Newby question.. I have a tensorflow code that creates a PROTO PredictRequest class. Is there a python wrapper/utility code to automatically convert it to InferInput objects to be later on used in :

inputs.append(grpcclient.InferInput( ... ))
inputs[1].set_data_from_numpy(input1_data)

Decreased Accuracy in Text Detection and Recognition Models after Upgrading to tritonclient 23.04-py3

Issue Summary:

After upgrading from tritonclient version 21.02-py3 to 23.04-py3, I noticed a decrease in accuracy for my text detection and recognition models, which are based on PaddleOCR. Prior to the upgrade, these models were delivering accurate results. The models are being used in their ONNX format for inference, not using the optimized plan version.

Steps to Reproduce:

Upgrade tritonclient to version 23.04-py3.
Use the text detection and recognition models based on PaddleOCR in their ONNX format.
Compare the results of text detection and recognition with those obtained using tritonclient version 21.02-py3.
Expected Behavior:

The text detection and recognition models should maintain or improve their accuracy after upgrading tritonclient.

Actual Behavior:

The accuracy of the text detection and recognition models has decreased significantly after the upgrade.

Additional Information:

The models are not in the optimized plan format; they are used directly in ONNX format.
Any specific configuration changes or updates in tritonclient version 23.04-py3 that could affect the accuracy of models based on PaddleOCR?
Are there any known compatibility issues or changes in behavior related to ONNX model inference in the newer tritonclient version?

Environment:

Tritonclient version: 23.04-py3
GPU : Tesla V100-PCIE-16GB

AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

I have a Triton Server that runs in Docker. There I initialized the CLIP model. I wrote some simple code to try infer and get the output of this model. But I get the error AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Here is my client code:

from transformers import CLIPProcessor
from PIL import Image
import tritonclient.http as httpclient

if __name__ == "__main__":

    triton_client = httpclient.InferenceServerClient(url="localhost:8003")

    # Example of tracing an image processing:
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    image = Image.open("9997103.jpg").convert('RGB')

    inputs = processor(images=image, return_tensors="pt")['pixel_values']

    inputs = []
    outputs = []

    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
    inputs[0].set_data_from_numpy(image)

    outputs.append(triton_client.InferRequestedOutput("output__0", binary_data=False))

    results = triton_client.infer(
        model_name='clip',
        inputs=inputs,
        outputs=outputs,
    )

    print(results.as_numpy("output__0"))

Here's the error I'm getting

Traceback (most recent call last):
  File "main.py", line 18, in <module>
    inputs.append(triton_client.InferInput("input__0", image.shape, "TYPE_FP32"))
AttributeError: 'InferenceServerClient' object has no attribute 'InferInput'

Help me please

Revert urllib3 version pin

Hi. We are trying to integrate with OIP model servers over at feast feature store and need to add mlserver and tritonclient as optional dependencies. The problem is that we also already depend on snowflake-connector-python which still has a strict urllib3<2.0.0 requirement for python 3.9. I saw that urllib3 version pin here was only to avoid vulnerability reports. #457 Not sure which vulnerability that was referring to, but seems like urlib3 plan to ship security fixes for v1 still. Is is possible to revert the version pin? Or maybe allow something like (>=1.26.18<2 or >=2.0.7). Thanks

Allow client to request subset of ensemble model outputs

Problem:

A common use case of ensemble model is preprocess -> inference -> postprocess. In most case, user will request last step output (postprocessed inference), as it might for example reduce networking between Triton & Application. But often, the application might also require the raw inference result (intermediate output). The current configuration & client interface (at least the Python one), seem to support requesting a subset or all the output (Python client takes a list of InferRequestedOutput). But for some reason, requesting a subset of all output results in InferenceServerException: [StatusCode.INVALID_ARGUMENT] in ensemble, [request id: ] unexpected deadlock, at least one output is not set while no more ensemble steps can be made.

Solution:

As of today, the only solution I found is to create 2 ensemble model (or actually N - 1 ensemble where N is the total number of steps):

  • preprocess -> inference
  • preprocess -> inference -> postprocess
    Then, my client is either requesting output from first or second ensemble model. Not ideal as it introduce unnecessary complexity.

Performance Analyzer cannot collect metrics on Jetson Xavier

I have deployed Triton on Jetson Xavier and used Performance Analyzer to measure model performance during inference. Latency and throughput are correctly measured, but when I try to collect metrics using --collect-metrics option, the following message appears:

WARNING: Unable to parse โ€˜nv_gpu_utilizationโ€™ metric.
WARNING: Unable to parse โ€˜nv_gpu_power_usageโ€™ metric.
WARNING: Unable to parse โ€˜nv_gpu_memory_used_bytesโ€™ metric.
WARNING: Unable to parse โ€˜nv_gpu_memory_total_bytesโ€™ metric.

The command I am using to launch the inferences is:
/usr/local/bin/perf_analyzer --collect-metrics -m 3D_fp32_05_batchd -b 1 --concurrency-range 1

Is it possible to solve this problem?

Thanks.

urllib dependency is present when using [grpc, cuda] options

urllib is only used for the http backend based on my look at the code, but the top level requirements.txt list urllib as a dependency. However, it should be only in the requirements_http.txt and not in the top level requirements.txt.

This creates unnecessary conflicts when one only wants to depend on grpc and not http backend. Dependency conflicts like the one reported here: #648

Converting InferenceRequest to InferInput

I have a decoupled model (python backend) that receives requests from a client and sends the request to another downstream model, and the intermediate model only processes some inputs and passes the rest to the next model.

Currently, I'm converting inputs to numpy arrays first and then wrap them in InferInput.

for input_name in self.input_names[1:]:
                data_ = pb_utils.get_input_tensor_by_name(request, input_name)\
                    .as_numpy()\
                    .reshape(-1)
                input_ = triton_grpc.InferInput(input_name, data_.shape, "FP32" if data_.dtype == np.float32 else "INT32")
                input_.set_data_from_numpy(data_)
                inputs.append(input_)

However, I think the .as_numpy() and .set_data_from_numpy() functions do some (de)serialization, and using a for loop to copy most of the inputs is a little bit inefficient.

Is there a way to convert InferenceRequest to InferInput more efficiently?

Thanks!

Delayed aio infer during burst requests

Background

I have a simple FastAPI app that will call Triton server using Triton client. It works great for small, non-rapid requests. But the behavior starts to decline when I introduce a lot of concurrent requests, such as when using JMeter. Originally I thought the bottleneck was on my Triton server, but turns out it happened on the client (FastAPI app). Both Triton server and client are on my local PC.

Reproduce attempt

Below are all components required to reproduce this problem.

# Repo structure
.
โ”œโ”€โ”€ model_repository
โ”‚   โ””โ”€โ”€ dummy
โ”‚       โ”œโ”€โ”€ 1
โ”‚       โ”‚   โ””โ”€โ”€ model.py
โ”‚       โ””โ”€โ”€ config.pbtxt
<snip>
# config.pbtxt

backend: "python"
max_batch_size: 10

dynamic_batching {}

input [
  {
    name: "image"
    data_type: TYPE_FP32
    dims: [3, 1000, 1000]
  }
]

output [
  {
    name: "result"
    data_type: TYPE_INT32
    dims: [5]
  }
]
# model.py
import time

import numpy as np
import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    def execute(self, requests):
        n = len(requests)
        print(f"received {n} requests, at {time.time()}", flush=True)
        # always return dummy tensor
        responses = []
        for i in range(n):
            responses.append(
                pb_utils.InferenceResponse(
                    [pb_utils.Tensor("result", np.arange(5, dtype=np.int32))]
                )
            )
        return responses
# http_client_aio.py
import asyncio

import numpy as np
import tritonclient.http.aio as httpclient


async def main():
    client = httpclient.InferenceServerClient(url="localhost:8000")
    fake_image = np.random.random((1, 3, 1000, 1000)).astype(np.float32)

    tasks = []
    for i in range(500):
        fake_input = httpclient.InferInput("image", fake_image.shape, "FP32")
        fake_input.set_data_from_numpy(fake_image)
        task = asyncio.create_task(client.infer("dummy", [fake_input]))
        tasks.append(task)
    print("done sending")

    await asyncio.gather(*tasks)
    await client.close()


if __name__ == "__main__":
    asyncio.run(main())

Docker commands for Triton server.

docker run --shm-size=1gb --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models

I opened 2 terminals, run docker commands on the first and http_client_aio.py on the second window. Here is what happened:

  1. Start run http_client_aio.py
  2. "done sending" printed after ~2 secs.
  3. There are ~3 secs of delay which nothing seems to happen.
  4. Logs in Triton server (TritonPythonModel print statements) starts rolling in.

Questions

  1. Am I missing something?
  2. What happened on those ~3 secs delay?
  3. Can I saturate Triton server using just aio client with relatively big inference payload (I use 1000x1000 image resolution)?
  4. What are the best practices to handle this bottleneck?

Incomplete installation of all genai-perf dependencies prevents its from being run on air-gapped servers

When genai-perf is installed using pip from Github (as documented), on first run it tries to download several files from Huggingface, like this:

$ docker run --rm -it --name test -u 0 gpu-tritonserver-tst:latest bash -c "genai-perf --help"
tokenizer_config.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 700/700 [00:00<00:00, 5.45MB/s]
tokenizer.model: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 500k/500k [00:00<00:00, 78.0MB/s]
tokenizer.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1.84M/1.84M [00:00<00:00, 3.25MB/s]
special_tokens_map.json: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 411/411 [00:00<00:00, 2.72MB/s]
usage: genai-perf [-h] [--expected-output-tokens EXPECTED_OUTPUT_TOKENS] [--input-type {url,file,synthetic}] [--input-tokens-mean INPUT_TOKENS_MEAN] [--input-tokens-stddev INPUT_TOKENS_STDDEV] -m MODEL
                  [--num-of-output-prompts NUM_OF_OUTPUT_PROMPTS] [--output-format {openai_chat_completions,openai_completions,trtllm,vllm}] [--random-seed RANDOM_SEED] [--concurrency CONCURRENCY]
                  [--input-data INPUT_DATA] [-p MEASUREMENT_INTERVAL] [--profile-export-file PROFILE_EXPORT_FILE] [--request-rate REQUEST_RATE] [--service-kind {triton,openai}] [-s STABILITY_PERCENTAGE]
                  [--streaming] [-v] [--version] [--endpoint ENDPOINT] [-u URL] [--dataset {openorca,cnn_dailymail}]

CLI to profile LLMs and Generative AI models with Perf Analyzer
[..]

This "calling-home" behavior will prevent genai-perf from running correctly in corporate air-gapped environments. All required dependencies need to be collected by Python or Bash scripts at install time (which can occur on a different sever, such as an internet-connected build server(s) where most pull actions are permitted) rather than being pulled upon the first run of the program.

How to write cmakelist to use grpcclient?

I add grpcclient_static to target_link_libraries in cmakelist. But when compiling, there are link errors about grpc. Please tell me where is wrong, thank you๏ผ

genai-perf KeyError: 'service_kind'

environment

pip install "git+https://github.com/triton-inference-server/[email protected]#subdirectory=src/c++/perf_analyzer/genai-perf"
genai-perf==0.0.3

command

genai-perf \
  -m codex-py-mr \
  --service-kind triton \
  --backend tensorrtllm  \
  --tokenizer /mnt/models/source/tokenizer \
  --concurrency 1 \
  --measurement-interval 4000 \
  --artifact-dir artifacts \
  --profile-export-file my_profile_export.json \
  --url 0.0.0.0:9001 \
  --input-file /data/script/llm_inputs.json \
  --extra-inputs "temperature":0.1 \
  --extra-inputs "top_p":1.0 \
  --extra-inputs "top_k":50 \
  --extra-inputs "random_seed":809451

Run log

2024-06-13 03:51 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m codex-py-mr --async --input-data artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/llm_inputs.json --service-kind triton -u 0.0.0.0:9001 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/codex-py-mr-triton-tensorrtllm-concurrency1/my_profile_export.json -i grpc --streaming --shape max_tokens:1 --shape text_input:1 --concurrency-range 1'
WARNING: Pass contained only one request, so sample latency standard deviation will be infinity (UINT64_MAX).
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 144, in run
    data_parser = calculate_metrics(args, tokenizer)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 87, in calculate_metrics
    return LLMProfileDataParser(
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 533, in __init__
    super().__init__(filename)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 445, in __init__
    self._get_profile_metadata(data)
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/llm_metrics.py", line 449, in _get_profile_metadata
    self._service_kind = data["service_kind"]
KeyError: 'service_kind'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 154, in main
    run()
  File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 147, in run
    raise GenAIPerfException(e)
genai_perf.exceptions.GenAIPerfException: 'service_kind'
2024-06-13 03:51 [ERROR] genai_perf.main:158 - 'service_kind'

Failing with Generic Error message: Failed to obtain stable measurement.

I am testing on the basic models. Model take input and return the same output of same datatype.

Inference is happening:
2024-08-20 09:35:15,923 - INFO - array_final: array([[103]], dtype=uint8)
array_final: [[103]]

perf_analyzer -m model_equals_b_uint8 --measurement-mode count_windows --measurement-request-count 5 -v
*** Measurement Settings ***
Batch size: 1
Service Kind: TRITON
Using "count_windows" mode for stabilization
Stabilizing using average latency and throughput
Minimum number of samples in each window: 5
Using synchronous calls for inference

Request concurrency: 1
Pass [1] throughput: 478.67 infer/sec. Avg latency: 2059 usec (std 3372 usec).
Pass [2] throughput: 621.713 infer/sec. Avg latency: 1625 usec (std 3008 usec).
Pass [3] throughput: 491.884 infer/sec. Avg latency: 2027 usec (std 13098 usec).
Pass [4] throughput: 18.0441 infer/sec. Avg latency: 54594 usec (std 80657 usec).
Pass [5] throughput: 16.6456 infer/sec. Avg latency: 61007 usec (std 68590 usec).
Pass [6] throughput: 62.8963 infer/sec. Avg latency: 5822 usec (std 7896 usec).
Pass [7] throughput: 16.871 infer/sec. Avg latency: 60256 usec (std 112842 usec).
Pass [8] throughput: 15.6989 infer/sec. Avg latency: 63212 usec (std 110034 usec).
Pass [9] throughput: 15.1902 infer/sec. Avg latency: 65797 usec (std 87972 usec).
Pass [10] throughput: 14.0266 infer/sec. Avg latency: 72140 usec (std 93986 usec).
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-request-count.
Failed to obtain stable measurement.

perf_analyzer -m model_equals_b_uint8 --measurement-mode count_windows --measurement-request-count 50 -v
*** Measurement Settings ***
Batch size: 1
Service Kind: TRITON
Using "count_windows" mode for stabilization
Stabilizing using average latency and throughput
Minimum number of samples in each window: 50
Using synchronous calls for inference

Request concurrency: 1
Pass [1] throughput: 23.4639 infer/sec. Avg latency: 42614 usec (std 182802 usec).
Pass [2] throughput: 141.78 infer/sec. Avg latency: 3437 usec (std 5377 usec).
Pass [3] throughput: 14.8405 infer/sec. Avg latency: 67552 usec (std 97666 usec).
Pass [4] throughput: 12.2003 infer/sec. Avg latency: 82423 usec (std 75027 usec).
Pass [5] throughput: 14.2399 infer/sec. Avg latency: 70712 usec (std 120651 usec).
Pass [6] throughput: 86.8397 infer/sec. Avg latency: 2083 usec (std 2502 usec).
Pass [7] throughput: 22.6803 infer/sec. Avg latency: 45020 usec (std 178493 usec).
Pass [8] throughput: 17.8704 infer/sec. Avg latency: 56233 usec (std 175833 usec).
Pass [9] throughput: 23.0646 infer/sec. Avg latency: 43166 usec (std 148978 usec).
Pass [10] throughput: 18.234 infer/sec. Avg latency: 55330 usec (std 102755 usec).
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-request-count.
Failed to obtain stable measurement.

perf_analyzer -m model_equals_b_uint8 --measurement-mode count_windows --measurement-request-count 75 -v
*** Measurement Settings ***
Batch size: 1
Service Kind: TRITON
Using "count_windows" mode for stabilization
Stabilizing using average latency and throughput
Minimum number of samples in each window: 75
Using synchronous calls for inference

Request concurrency: 1
Pass [1] throughput: 428.863 infer/sec. Avg latency: 2328 usec (std 3510 usec).
Pass [2] throughput: 494.642 infer/sec. Avg latency: 2018 usec (std 3441 usec).
Pass [3] throughput: 308.695 infer/sec. Avg latency: 3156 usec (std 13751 usec).
Pass [4] throughput: 340.429 infer/sec. Avg latency: 1828 usec (std 3966 usec).
Pass [5] throughput: 21.0775 infer/sec. Avg latency: 47814 usec (std 168738 usec).
Pass [6] throughput: 18.7684 infer/sec. Avg latency: 53730 usec (std 65595 usec).
Pass [7] throughput: 16.0608 infer/sec. Avg latency: 62265 usec (std 63152 usec).
Pass [8] throughput: 3.68812 infer/sec. Avg latency: 271139 usec (std 363750 usec).
Pass [9] throughput: 203.656 infer/sec. Avg latency: 4908 usec (std 6825 usec).
Pass [10] throughput: 214.693 infer/sec. Avg latency: 2469 usec (std 3830 usec).
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-request-count.
Failed to obtain stable measurement.

perf_analyzer -m model_equals_b_uint8 --measurement-mode count_windows --measurement-request-count 100 -v
*** Measurement Settings ***
Batch size: 1
Service Kind: TRITON
Using "count_windows" mode for stabilization
Stabilizing using average latency and throughput
Minimum number of samples in each window: 100
Using synchronous calls for inference

Request concurrency: 1
Pass [1] throughput: 423.137 infer/sec. Avg latency: 2331 usec (std 2866 usec).
Pass [2] throughput: 99.6489 infer/sec. Avg latency: 10037 usec (std 135019 usec).
Pass [3] throughput: 253.617 infer/sec. Avg latency: 1639 usec (std 1605 usec).
Pass [4] throughput: 16.316 infer/sec. Avg latency: 62273 usec (std 161047 usec).
Pass [5] throughput: 22.5236 infer/sec. Avg latency: 44084 usec (std 143282 usec).
Pass [6] throughput: 13.3747 infer/sec. Avg latency: 75319 usec (std 81540 usec).
Pass [7] throughput: 15.3824 infer/sec. Avg latency: 65006 usec (std 130209 usec).
Pass [8] throughput: 2.24593 infer/sec. Avg latency: 445246 usec (std 205477 usec).
Pass [9] throughput: 145.757 infer/sec. Avg latency: 2459 usec (std 4845 usec).
Pass [10] throughput: 15.9015 infer/sec. Avg latency: 63902 usec (std 89986 usec).
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-request-count.
Failed to obtain stable measurement.

perf_analyzer -b 1 -m model_equals_b_uint16 -v
*** Measurement Settings ***
Batch size: 1
Service Kind: TRITON
Using "time_windows" mode for stabilization
Stabilizing using average latency and throughput
Measurement window: 5000 msec
Using synchronous calls for inference

Request concurrency: 1
Pass [1] throughput: 264.819 infer/sec. Avg latency: 2428 usec (std 19166 usec).
Pass [2] throughput: 19.249 infer/sec. Avg latency: 45776 usec (std 59715 usec).
Pass [3] throughput: 11.4458 infer/sec. Avg latency: 87830 usec (std 55669 usec).
Pass [4] throughput: 13.7479 infer/sec. Avg latency: 73070 usec (std 177674 usec).
Pass [5] throughput: 16.5643 infer/sec. Avg latency: 59318 usec (std 166888 usec).
Pass [6] throughput: 11.5103 infer/sec. Avg latency: 86986 usec (std 188720 usec).
Pass [7] throughput: 32.5302 infer/sec. Avg latency: 31859 usec (std 184371 usec).
Pass [8] throughput: 23.3457 infer/sec. Avg latency: 42082 usec (std 186189 usec).
Pass [9] throughput: 14.2139 infer/sec. Avg latency: 70781 usec (std 194576 usec).
Pass [10] throughput: 14.5149 infer/sec. Avg latency: 68353 usec (std 190451 usec).
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-interval.
Failed to obtain stable measurement.

Everytime i am getting same generic error message.Though I am increasing the measurement-request-count

Benchmarking VQA Model with Large Base64-Encoded Input Using perf_analyzer

Hello,

I've been deploying my VQA (Vision Query Answer) model using Triton Server and utilizing the perf_analyzer tool for benchmarking. However, using random data for the VQA model leads to undefined behavior, making it crucial to use real input data, which is challenging to construct. Below is the command I used with perf_analyzer:

perf_analyzer -m <model_name> --request-rate-range=10 --measurement-interval=30000 --string-data '{"imageBase64Str": "/9j/4AAQS...D//Z", "textPrompt": "\u8bf7\u5e2e\...\u3002"}'

The model expects a JSON-formatted string as input, with two fields: 'imageBase64Str', which contains base64-encoded image data, and 'textPrompt', which is the text input.

Fortunately, this method works. However, the request rate is disappointingly slow, averaging 500ms per request, even when I set --request-rate-range=10. I encountered the following warning:

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 100.00% of the requests were delayed.

I'm facing difficulties in benchmarking my model effectively, as it isn't receiving a sufficient number of requests at present. I suspect that the large size of the base64 data in the '--string-data' option is contributing to the slowdown. Is there a faster or better way to send requests that could help me achieve a more accurate benchmark?

Best regards,

Memory leak in SharedMemoryTensor.__dlpack__

Hello, a memory leak was detected when executing this code. The code was run on Python 3.10., triton-client 2.41.1, torch 2.1.2.

import torch
import tritonclient.utils.cuda_shared_memory as cudashm

n1 = 1000
n2 = 1000
gpu_tensor = torch.ones([n1, n2]).cuda(0)
byte_size = 4 * n1 * n2
shm_handle = cudashm.create_shared_memory_region("cudashm_data", byte_size, 0)
while True:
    cudashm.set_shared_memory_region_from_dlpack(shm_handle, [gpu_tensor]
    smt = cudashm.as_shared_memory_tensor(shm_handle, "FP32", [n1, n2])
    generated_torch_tensor = torch.from_dlpack(smt)

The leak occurs when the dlpack function is called in torch.from_dlpack(smt)

How to ensure `load_model` applies to the same server pod as `infer`?

In a k8s environment, there are multiple server replicas. We use python client, and at server side we use explicit mode.

Now we do something like

service_url = "triton.{namespace}.svc.cluster.local:8000"
triton_client = InferenceServerClient(url=server_url)

if not triton_client.is_model_ready(model_name):
    triton_client.load_model(model_name)

triton_client.infer(
    model_name,
    model_version=model_version,
    inputs=triton_inputs,
    outputs=triton_outputs,
)

Since server_url is the k8s service endpoint, I guess it relies on k8s load balancer to choose a random pod. In this case, how do we ensure all is_model_ready, load_model and infer apply to the same pod?

Any example of triton-vllm in c++?

I use the tc::InferenceServerGrpcClient.Infer but got the error of not supported support models with decoupled transaction policy.
when I use tc::InferenceServerGrpcClient.AsyncInfer, no error message but no request in the server. but the server log has error: Infer failed: ModelInfer RPC doesn't support models with decoupled transaction policy
which API should I use?

make cc-clients: Could not find requested file: RapidJSON-targets.cmake

cmake is not successful

 โฏ cmake --version
cmake version 3.21.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_EXAMPLES=ON ..
make cc-clients


...
[ 92%] Performing configure step for 'cc-clients'
loading initial cache file /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/tmp/cc-clients-cache-Release.cmake
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  /home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/_deps/repo-common-src/CMakeLists.txt:48 (find_package)


-- RapidJSON found. Headers:
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found Python: /usr/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- Found Protobuf: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/protobuf/bin/protoc-3.19.4.0 (found version "3.19.4.0")
-- Using protobuf 3.19.4.0
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11")
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.1.1f")
-- Found c-ares: /home/hayley/nvidia_trt_llm_backend/client/buid/third-party/c-ares/lib/cmake/c-ares/c-ares-config.cmake (found version "1.17.2")
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
-- Found Python: /home/hayley/llm_serving/.venv/bin/python3 (found version "3.8.10") found components: Interpreter
-- Using protobuf 3.19.4.0
-- Using protobuf 3.19.4.0
-- Found RE2 via CMake.
-- Using gRPC 1.48.0
CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:3 (include):
  include could not find requested file:

    /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSON-targets.cmake
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


CMake Error at /home/hayley/nvidia_trt_llm_backend/rapidjson/build/RapidJSONConfig.cmake:17 (get_target_property):
  get_target_property() called with non-existent target "RapidJSON".
Call Stack (most recent call first):
  library/CMakeLists.txt:49 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeOutput.log".
See also "/home/hayley/nvidia_trt_llm_backend/client/buid/cc-clients/CMakeFiles/CMakeError.log".
make[3]: *** [CMakeFiles/cc-clients.dir/build.make:96: cc-clients/src/cc-clients-stamp/cc-clients-configure] Error 1
make[2]: *** [CMakeFiles/Makefile2:119: CMakeFiles/cc-clients.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:126: CMakeFiles/cc-clients.dir/rule] Error 2
make: *** [Makefile:124: cc-clients] Error 2

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unable to open shared memory region: '/output_simple'

I run the src/python/examples/simple_grpc_shm_client.py file ,but got error below
Traceback (most recent call last):
File "demo.py", line 95, in
triton_client.register_system_shared_memory(
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_client.py", line 1223, in register_system_shared_memory
raise_error_grpc(rpc_error)
File "/home/jiangzhao/anaconda3/envs/wav2lip192/lib/python3.8/site-packages/tritonclient/grpc/_utils.py", line 77, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Unable to open shared memory region: '/output_simple'
how can i fix this?

Unable to use triton client with shared memory in C++ (Jetpack 6 device)

I am using the tritonserver + client igpu release (tritonserver2.41.0-igpu.tar.gz) on a jetpack 6 device. I want to use the shared memory functions with the triton client which are declared in shm_utils.h and defined in shm_utils.cc. However, the header is not found in Triton Client's include directory leading to a compilation error.

On making the following changes to src/c++/library/CMakeLists.txt and building the client from source, I was able to import the header and use the shared memory functions. (Triton Client Branch - r23.12)

@@ -84,12 +84,12 @@ if(TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER)
   # libgrpcclient object build
   set(
       REQUEST_SRCS
-      grpc_client.cc common.cc
+      grpc_client.cc common.cc shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      grpc_client.h common.h ipc.h
+      grpc_client.h common.h ipc.h shm_utils.h
   )
 
   add_library(
@@ -257,12 +257,12 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_PERF_ANALYZER)
   # libhttpclient object build
   set(
       REQUEST_SRCS
-      http_client.cc common.cc cencode.c
+      http_client.cc common.cc cencode.c shm_utils.cc
   )
 
   set(
       REQUEST_HDRS
-      http_client.h common.h ipc.h cencode.h
+      http_client.h common.h ipc.h cencode.h shm_utils.h
   )
 
   add_library(
@@ -394,6 +394,7 @@ if(TRITON_ENABLE_CC_HTTP OR TRITON_ENABLE_CC_GRPC OR TRITON_ENABLE_PERF_ANALYZER
       FILES
       ${CMAKE_CURRENT_SOURCE_DIR}/common.h
       ${CMAKE_CURRENT_SOURCE_DIR}/ipc.h
+      ${CMAKE_CURRENT_SOURCE_DIR}/shm_utils.h
       DESTINATION include
   )
 

Are the shared memory cc and header files not included by default, or am I not including them correctly during compilation?

Add support for FetchContent or find_package

Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.

find_package

Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Alternatively install tritonclient globally. In any case, the following minimal cmake fails:

# tritonclient
find_package(TritonCommon REQUIRED)
find_package(TritonClient REQUIRED)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:43 (add_executable):
  Target "test" links to target "protobuf::libprotobuf" but the target was
  not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

FetchContent

# tritonclient
FetchContent_Declare(
    tritonclient
    GIT_REPOSITORY https://github.com/triton-inference-server/client
    GIT_TAG r24.04
)
set(TRITON_ENABLE_CC_GRPC ON)
set(TRITON_COMMON_REPO_TAG r24.04)
set(TRITON_THIRD_PARTY_REPO_TAG r24.04)
set(TRITON_CORE_REPO_TAG r24.04)
FetchContent_MakeAvailable(tritonclient)

...
target_link_libraries(test PRIVATE TritonClient::grpcclient rt m dl)
CMake Error at CMakeLists.txt:55 (add_executable):
  Target "test" links to target "TritonClient::grpcclient" but the target
  was not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?

What is the recommended way to create a standalone app with tritonclient dependency currently? It seems that for find_package the link targets are broken and for FetchContent there is no support at all because no targets are exported.

When using the Java Triton Client, why must INT8 type InputTensors be passed using int_contents? This causes the inference results of passing INT8 Embedding Tensors to the model to always not align with the Python client.

List inputEmbeddingData = new ArrayList<>();
byte[] tempArr = new byte[embeddings.size()];
for (int i = 0; i < tempArr.length; i++) {
tempArr[i] = embeddings.get(i);
}
ByteBuffer byteBuffer = ByteBuffer.wrap(tempArr);
byteBuffer.order(ByteOrder.LITTLE_ENDIAN);
while (byteBuffer.hasRemaining()) {
inputEmbeddingData.add(byteBuffer.getInt());
}

GrpcService.InferTensorContents.Builder inputEmbeddingContentsBuilder = GrpcService.InferTensorContents.newBuilder();
inputEmbeddingData.forEach(inputEmbeddingContentsBuilder::addIntContents);
GrpcService.ModelInferRequest.InferInputTensor inputEmbeddingTensor = GrpcService.ModelInferRequest.InferInputTensor.newBuilder()
.setName(INPUT_TENSOR_INPUT_EMBEDDINGS)
.setDatatype("INT8")
.addShape(1L)
.addShape(embeddings.size())
.setContents(inputEmbeddingContentsBuilder.build())
.build();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.