triton-inference-server / pytriton Goto Github PK

View Code? Open in Web Editor NEW

706.0 706.0 48.0 6.79 MB

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.

Home Page: https://triton-inference-server.github.io/pytriton/

License: Apache License 2.0

Makefile 0.36% Python 94.07% Shell 5.56%

deep-learning gpu inference

pytriton's Introduction

Triton Inference Server

📣 vLLM x Triton Meetup at Fort Mason on Sept 9th 4:00 - 9:00 pm

We are excited to announce that we will be hosting our Triton user meetup with the vLLM team at Fort Mason on Sept 9th 4:00 - 9:00 pm. Join us for this exclusive event where you will learn about the newest vLLM and Triton features, get a glimpse into the roadmaps, and connect with fellow users, the NVIDIA Triton and vLLM teams. Seating is limited and registration confirmation is required to attend - please register here to join the meetup.

[!WARNING]

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.48.0 and corresponds to the 24.07 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

Supports multiple deep learning frameworks
Supports multiple machine learning frameworks
Concurrent model execution
Dynamic batching
Sequence batching and implicit state management for stateful models
Provides Backend API that allows adding custom backends and pre/post processing operations
Supports writing custom backends in python, a.k.a. Python-based backends.
Model pipelines using Ensembling or Business Logic Scripting (BLS)
HTTP/REST and GRPC inference protocols based on the community developed KServe protocol
A C API and Java API allow Triton to link directly into your application for edge and other in-process use cases
Metrics indicating GPU utilization, server throughput, server latency, and more

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.07 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.07-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Install Triton Inference Server with Docker containers (Recommended)
Install Triton Inference Server without Docker containers
Build a custom Triton Inference Server Docker container
Build Triton Inference Server from source
Build Triton Inference Server for Windows 10
Examples for deploying Triton Inference Server with Kubernetes and Helm on GCP, AWS, and NVIDIA FleetCommand
Secure Deployment Considerations

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Add custom operations to Triton if needed by your model
Enable model pipelining with Model Ensemble and Business Logic Scripting (BLS)
Optimize your models setting scheduling and batching parameters and model instances.
Use the Model Analyzer tool to help optimize your model configuration with profiling
Learn how to explicitly manage what models are available by loading and unloading models

Configure and Use Triton Inference Server

Read the Quick Start Guide to run Triton Inference Server on both GPU and CPU
Triton supports multiple execution engines, called backends, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, and more
Not all the above backends are supported on every platform supported by Triton. Look at the Backend-Platform Support Matrix to learn which backends are supported on your target platform.
Learn how to optimize performance using the Performance Analyzer and Model Analyzer
Learn how to manage loading and unloading models in Triton
Send requests directly to Triton with the HTTP/REST JSON-based or gRPC protocols

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Review client examples for C++, Python, and Java
Configure HTTP and gRPC client options
Send input data (e.g. a jpeg image) directly to Triton in the body of an HTTP request without any additional metadata

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Customize Triton Inference Server container for your use case
Create custom backends in either C/C++ or Python
Create decoupled backends and models that can send multiple responses for a request or not send any responses for a request
Use a Triton repository agent to add functionality that operates when a model is loaded and unloaded, such as authentication, decryption, or conversion
Deploy Triton on Jetson and JetPack
Use Triton on AWS Inferentia

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

minimal – use as little code as possible that still produces the same problem
complete – provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
verifiable – test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

pytriton's People

Contributors

Stargazers

Watchers

Forkers

codeaudit kaidduong kb231000 szalpal jeunesseafricaine yummy0929 nv-blazejkubiak wizyke chrisogonasai chrisogonas conceptofmind techthiyanes mirjunaid26 nivir pyver jjhw jrhumberto oryondark hiyoung-asr brstar96 clementetienam pziecina-nv stevewithington fjallraven-hc mahimairaja bmfire1 yeahdongcn yonggucheng jindyliu zhang0557kui songzzz rubenszimbres wilson97 rivenlp peganovanton sihabsahariar pydemia xstonex thewq catwell getty708 pylerai yongbinfeng wphicks caiyueliang duckyngo

pytriton's Issues

Example of (or support for) Inference Callable of Triton ensemble definition

Is your feature request related to a problem? Please describe.
I currently have an automated config generator for creating proper ensemble pbtxt configs over a DAG of backends. I currently do not see a natural way of defining and/or referencing Triton ensembles with the PyTriton module.

Describe the solution you'd like
I would like to either:
(a) create the ensemble definition directly with the PyTriton module, or
(b) load all ensemble models and the associated (already generated via above method) ensemble and model configs

Describe alternatives you've considered
Our current solution mentioned in the feature request does the job, but does not utilize PyTriton's friendly interface.

Additional context
N/A

Sagemaker example

Hi there, is there any worked example for running pytriton on Sagemaker? I see the TritonConfig class has parameters like allow_sagemaker, sagemaker_port, and such; do you have any (basic) instructions on how to use them to share please?

May input tensor of the Callable be GPU addressed Tensor?

As shown in the title, can GPU Tensors be inputs of the Callable, such as torch.Tensor on device? For a complex application with multiple backends connected in series, which will result in multiple H2D and D2H operations and waste a lot of time with numpy.ndarray only scheme.

How to infer with sequence ?

It seems that the ModelClient does not support sequence inference. "sequence_start", "sequence_id", "sequence_end" can not be found in infer_sample/infer_batch.

The content of this document is wrong

[question] Do you support the 'GRPC stream' feature?

Do you support the 'GRPC stream' feature?

If you don't support, do you have plan to develop GRPC stream feature in the future?

Thanks.

Error deploying model on Vertex AI

Description

Hi! I'm trying to deploy a StableDiffusion model in GCP Vertex AI using Pytriton backend. My code works on a local machine, and I've been able to send requests and receive inference responses.

My problem arrives when I'm trying to create an endpoint using Vertex AI. Server run fails with error:

WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait.

And then:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
...
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")

Don't know if the error with Vertex AI service is due to the server first crashing or vice-versa.

To reproduce

Attaching my server code

# server
import torch
from model import ModelWrapper
from pytriton.decorators import batch
from pytriton.model_config import DynamicBatcher, ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig, Triton
from urllib.parse import urlparse

class _InferFuncWrapper:
    """
    Class wrapper of inference func for triton. Used to also store the model variable
    """

    def __init__(self, model: torch.nn.Module):
        self._model = model

    @batch
    def __call__(self, **inputs) -> np.ndarray:
        """
        Main inference function for triton backend. Called after batch inference.
        Performs all the logic of decoding inputs, calling the model and returning
        outputs.

        Args:
            prompts: Batch of strings with the user prompts
            init_images: Batch of initial image to run the diffusion

        Returns
            image: Batch of generated images
        """
        (prompts, init_images) = inputs.values()
        # decode prompts and images
        prompts = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in prompts]
        init_images = [
            np.char.decode(enc_img.astype("bytes"), "utf-8").item()
            for enc_img in init_images
        ]
        init_images = [_decode_img(enc_img) for enc_img in init_images]
        # transform image arrays to tensors and adjust dims to torch usage
        images_tensors = torch.tensor(init_images, dtype=torch.float32).permute(
            0, 3, 1, 2
        )
        LOGGER.debug(f"Prompts: {prompts}")
        LOGGER.debug(f"{len(init_images)} images size: {init_images[0].shape}")
        LOGGER.info("Generating images...")
        # call diffusion model
        outputs = self._model.run(prompts, images_tensors)
        LOGGER.debug(f"Prepared batch response of size: {len(outputs)}")
        return {"image": np.array(outputs)}


def _parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--verbose",
        "-v",
        action="store_true",
        help="Enable verbose logging in debug mode.",
        default=True,
    )

    parser.add_argument(
        "--vertex",
        "-s",
        action="store_true",
        help="Enable copying model files from storage for vertex deployment",
        default=False,
    )

    return parser.parse_args()

def main():
    """Initialize server with model."""
    args = _parse_args()

    # initialize logging
    log_level = logging.DEBUG if args.verbose else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    if args.vertex:
        LOGGER.debug("Vertex: Loading pipeline from Vertex Storage")
        storage_path = os.environ["AIP_STORAGE_URI"]
    else:
        LOGGER.debug("Loading pipeline locally")
        storage_path = ("") # Path to local files
    
    bucket_name, subdirectory = parse_path(storage_path)
    LOGGER.debug(f"Downloading files... Started at: {time.strftime('%X')}")
    download_blob(bucket_name, subdirectory)
    LOGGER.debug(f"Files downloaded! Finished at: {time.strftime('%X')}")
    folder_path = os.path.join("src", subdirectory)

    LOGGER.debug(f"Running on device: {DEVICE}, dtype: {DTYPE}, triton_port:{PORT}")
    LOGGER.info("Loading pipeline...")
    model = ModelWrapper(logger=LOGGER, folder_path=folder_path)
    LOGGER.info("Pipeline loaded!")

    log_verbose = 1 if args.verbose else 0

    config = TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose)

    with Triton(config=config) as triton:
        # bind the model with its inference call and configuration
        triton.bind(
            model_name="StableDiffusion_Img2Img",
            infer_func=_InferFuncWrapper(model=model),
            inputs=[
                Tensor(name="prompt", dtype=np.bytes_, shape=(1,)),
                Tensor(name="init_image", dtype=np.bytes_, shape=(1,)),
            ],
            outputs=[
                Tensor(name="image", dtype=np.bytes_, shape=(1,)),
            ],
            config=ModelConfig(
                max_batch_size=4,
                batcher=DynamicBatcher(
                    max_queue_delay_microseconds=100,
                ),
            ),
            strict=True,
        )
        # serve the model for inference
        triton.serve()


if __name__ == "__main__":
    main()

When creating Vertex endpoint, server predict route is configured to:
/v2/models/StableDiffusion_Img2Img/infer

And server health route is configured to:
/v2/health/live

With Vertex port=8015, same as HTTP port set in model configuration.

Observed results and expected behavior

As stated, server runs on local machine, but fails initializing endpoint in Vertex AI.
During Vertex build, local files are correctly downloaded and model pipeline is loaded, so error is probably in triton.bind() function.
Attaching complete log output:

DEBUG - StableDiffusion_Img2Img.server: Files downloaded! Finished at: 18:28:01
DEBUG - StableDiffusion_Img2Img.server: Running on device: cuda, dtype: torch.float16, triton_port:8015
INFO - StableDiffusion_Img2Img.server: Loading pipeline..
INFO - StableDiffusion_Img2Img.server: Pipeline loaded!

...
2023-11-23 18:29:10,322 - DEBUG - pytriton.triton: Triton Inference Server binaries ready in /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver
2023-11-23 18:29:10,322 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
 2023-11-23 18:29:10,323 - DEBUG - pytriton.utils.distribution: Obtained nvidia_pytriton.libs path: /usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.triton: Starting Triton Inference
2023-11-23 18:29:10,324 - DEBUG - pytriton.server.triton_server: Triton Server binary /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver/bin/tritonserver. Environment:
{
...
}
2023-11-23 18:29:10,449 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99996042251587)
2023-11-23 18:29:12,954 - WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait
2023-11-23 18:29:12,954 - DEBUG - pytriton.server.triton_server: Triton Inference Server exit code 1
2023-11-23 18:29:12,954 - DEBUG - pytriton.triton: Got callback that tritonserver process finished
2023-11-23 15:31:10.655 Traceback (most recent call last):
2023-11-23 15:31:10.655 File "/home/app/src/server.py", line 200, in <module>
2023-11-23 18:31:10,655 - DEBUG - pytriton.triton: Cleaning model manager, tensor store and workspace.
failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
2023-11-23 18:31:10,655 - DEBUG - pytriton.utils.workspace: Cleaning workspace dir /root/.cache/pytriton/workspace_y7vpgv3x
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")
pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.

Additional steps taken

From the timeout error raised by Pytriton we've tried increasing timeout by setting monitoring_period_s in server.run() to an arbitrary high threshold.

We've also tied adapting server configuration to Vertex with:

TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose, allow_vertex_ai=true, vertex_ai_port=8080)

But getting same error.

Environment

Docker base image: nvcr.io/nvidia/pytorch:23.10-py3
Requierements:

torch @ https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl
diffusers==0.7.2
transformers==4.21.3
ftfy==6.1.1
importlib-metadata==4.13.0
nvidia-pytriton==0.4.1
Pillow==9.5
google-cloud-storage==2.10.0

Any help is appreciated!!

Dynamic Batching Seems Invalid?

I am trying to combine FastAPI with pytriton for Serving MobileNet_V2 TensorRT plan model. The inference service for the model has enabled dynamic batching and disabled response caching. I used Python's locust tool for pressure testing and manually printed the tensor shape and checked the metrics by triton server itself. It seems that dynamic batching does not work. My example code is as follows. May I ask if there is any problem with my code?

Code Snippet

# input shape: nx3x112x112, output shape=nx1000
with open("/mnt/data/codes/learning/python/pytriton/mobile_v2.plan", "rb") as f:
    runtime = trt.Runtime(trt.Logger(trt.Logger.ERROR))
    engine  = runtime.deserialize_cuda_engine(f.read())
    context = engine.create_execution_context() 


@batch
def infer_trt(**inputs: np.ndarray):
    (input1_batch,)           = inputs.values()
    input1_batch_tensor  = torch.from_numpy(input1_batch).to("cuda")

    print("input1_batch_tensor.shape = ", input1_batch_tensor.shape)

    context.set_binding_shape(0, input1_batch_tensor.shape)
    shape                   = context.get_binding_shape(1)
    output1_batch_tensor    = torch.empty(size=tuple(shape), dtype=torch.float32, device="cuda")

    bindings                = [None] * 2
    bindings[0]             = input1_batch_tensor.contiguous().data_ptr()
    bindings[1]             = output1_batch_tensor.contiguous().data_ptr()
    
    context.execute_async_v2(bindings, torch.cuda.current_stream().cuda_stream)
    output1_batch        = output1_batch_tensor.cpu().detach().numpy()
    return [output1_batch]

class ModelTrt(object):
    def __init__(self) -> None:
        self._run = False
        self.triton = Triton(config = triton_config)
        self.triton.bind(
            model_name = "MobileNet_v2",                  
            infer_func = infer_trt,
            inputs=[
                Tensor(dtype=np.float32, shape=(3, 112, 112)),    
            ],
            outputs=[
                Tensor(dtype=np.float32, shape=(1000, )),
            ],
            config = ModelConfig(batching = True,
                        max_batch_size = 16, 
                        batcher = DynamicBatcher(
                            max_queue_delay_microseconds = 50000,
                            preferred_batch_size = [8, 16],
                            preserve_ordering = True
                        ),
                        response_cache = False)
        )

     
    def run(self):
        if not self._run:
            self.triton.run()     
            self._run = True
        logger.info("Serving models")



    def __del__(self):
        if self._run:
            self.triton.stop()
        logger.info("Serving models stop")

server = ModelTrt()

server.run()

app   = FastAPI()

@app.post("/infer")
async def infer_image(request : Request):
    input1_batch = torch.randn(2, 3, 112, 112).cpu().detach().numpy() 

    with ModelClient("localhost", "MobileNet_v2") as client:
        logger.info("Sending request")
        result_dict = client.infer_batch(input1_batch)

    for output_name, output_batch in result_dict.items():
        # logger.info(f"{output_name}: {output_batch[0].shape}")
    return {output_name, output_batch[0].shape}

TensorRT Plan Model

trtexec --onnx=mobile_v2.onnx --saveEngine=mobile_v2.plan  \
        --minShapes=input:1x3x112x112  \
        --optShapes=input:1x3x112x112 \
        --maxShapes=input:16x3x112x112

Pressure Test

# locust_perf.py
from locust import task, FastHttpUser

class WebService(FastHttpUser):
    host = "http://localhost:31010"

    @task
    def root(self):
        with self.client.request(method = "POST", url="/infer") as resp:
            assert  resp.status_code == 200

locust -f locust_perf.py --headless -u 10 -r 1

Shape Info in Callable

Metrices Info by Triton Server

curl localhost:8002/metrics > metrics.txt

nv_inference_request_success{model="MobileNet_v2",version="1"} 1078
nv_inference_request_failure{model="MobileNet_v2",version="1"} 0
nv_inference_count{model="MobileNet_v2",version="1"} 2156
nv_inference_exec_count{model="MobileNet_v2",version="1"} 1078

Libs Version Info

nvidia-pytriton         0.2.0
tritonclient            2.34.0

Has this repo been abandoned?

It has been three weeks since the last commit.

The content of this document is incorrect

The content of this document is incorrect.

pytriton is slower than triton

My model is a bert classification model. Triton inference is 17ms per request. Pytriton is 34ms per request. Could someone can help me?

blow is pytriton server code

import torch
from transformers import BertConfig, BertTokenizer, BertForSequenceClassification
import os
import json
import os
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
import numpy as np
from transformers import BertTokenizer
import requests


label_list = ["finance", "realty", "stocks", "education", "science", "society", "politics", "sports", "game",
                  "entertainment", ]

pretrained_bert_dir = "/models"
tokenizer = BertTokenizer.from_pretrained(pretrained_bert_dir)
bert_config = BertConfig.from_pretrained(pretrained_bert_dir, num_labels=len(label_list))

model = BertForSequenceClassification(bert_config)
model.load_state_dict(torch.load(os.path.join(pretrained_bert_dir, "bert_model.pth")))
model.to("cuda")
model.eval()


@batch
def _infer_fn(sentence: np.ndarray):
    print(sentence)
    sequences_batch = np.char.decode(sentence.astype("bytes"), "utf-8")
    print(sequences_batch)

    labels = []
    for s in sequences_batch:
        inputs = tokenizer(s[0], max_length=32, truncation="longest_first", return_tensors="pt")
        inputs = inputs.to("cuda")

        outputs = model(**inputs)
        print(outputs)

        logits = outputs[0]
        print(logits)
        label_pro = torch.max(logits.data, 1)[1].tolist()
        print(label_pro)
        label = label_list[label_pro[0]]
        print(label)
        labels.append(label)
    return {"label": np.char.encode(labels, "utf-8")}

with Triton() as triton:
    triton.bind(
        model_name="BERT",
        infer_func=_infer_fn,
        inputs=[
            Tensor(name="sentence", dtype=np.bytes_, shape=(1,)),
        ],
        outputs=[
            Tensor(name="label", dtype=np.bytes_, shape=(1,)),
        ],
        config=ModelConfig(max_batch_size=10)
    )
    triton.serve()

blow is pytriton client code

import requests

sequence = ["明天要考试了"]
d = {
  "id": "0",
  "inputs": [
    {
      "name": "sentence",
      "shape": [1,1],
      "datatype": "BYTES",
      "data": sequence
    }
  ]
}
res = requests.post(url="http://114.116.17.243:31800/v2/models/BERT/infer",json=d).json()
r = res["outputs"][0]["data"][0]
print(r)

How to unload models?

There is no --model-control-mode in TritonConfig

How to implement machine learning model, receive request as json file from client and process them

request has the form { "id": "123", "rows": [ [ ... ], [ ... ] ], "columns": [ ... ] }
code :

def undecorated_identity_fn(**inputs):
    id = inputs["id"]
    rows = inputs["rows"]
    columns = inputs["columns"]
    logger.debug(f"id data: {id} {rows} {columns}")
    raw_df = pd.DataFrame(rows, columns=columns)
    raw_df = raw_df[["feature" + str(i) for i in range(1, 17)]]
    prediction = model_1.predict(raw_df)
    is_drifted = 0
    return [{
            "id": id,
            "predictions": prediction.tolist(),
            "drift": is_drifted,
        }]

with Triton() as triton:
    logger.info("Loading Linear model.")
    triton.bind(
        model_name="prob1",
        infer_func=undecorated_identity_fn,
        inputs=[
            Tensor(name="id", dtype=np.uint8, shape=(1,)),
            Tensor(name="rows",dtype=object, shape=(1,)),
            Tensor(name="columns", dtype=object, shape=(1,)),
        ],
        outputs=[
            Tensor(name="id", dtype=np.uint8, shape=(1,)),
            Tensor(name="predictions", dtype=object, shape=(1,)),
            Tensor(name="drift", dtype=np.uint, shape=(1,)),
        ],
        config=ModelConfig(max_batch_size=128),
    )
    logger.info("Serving models")
    triton.serve()

Log error: {"error":"Unable to parse 'inputs': attempt to access non-existing object member 'inputs'"}%

how to use instance group

I have a question about instance_group.

In TIS, i'm using instance_group in config.pbtxt.

instance_group [{
count: 3,
kind: KIND_GPU
}]

i want to use instance_group count 3 in single gpu in pytriton.

Is there a way to use instance_group as above in pytriton?

Enabling Redis cache throws: Unable to find shared library libtritonserver.so

Description

Running the docker image (below) with redis cache settings enabled via cache_config causes the triton server to fail on launch.

When I comment out both the cache_config and cache_directory options, the server starts and runs successfully but does not utilize the redis cache.

When I uncomment the cache_config and cache_directory options, the server fails to find the libtritonserver.so shared library file.

To reproduce

FROM nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3

WORKDIR /app

COPY . .

RUN pip install poetry

# Build https://github.com/triton-inference-server/redis_cache

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y python3 python3-distutils python-is-python3 git \
    build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev curl \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
    openssh-client cmake rapidjson-dev


RUN git clone https://github.com/triton-inference-server/redis_cache.git && \
    cd redis_cache && \
    ./build.sh && \
    mkdir -p /opt/tritonserver/caches/redis && \
    cp /app/redis_cache/build/libtritoncache_redis.so /opt/tritonserver/caches/redis/libtritoncache_redis.so

RUN cd triton-server && \
    poetry config virtualenvs.create false && \
    poetry install 


# HTTP, gRPC, & metrics traffic respecitvely
EXPOSE 8000 \
       8001 \
       8002

Fails with `unable to find shared library libtritonserver.so`

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer


@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]



def main(model_hidden_dim: int = 768):
    
    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()


if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Does not fail, but does not utilize `Redis`

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    # cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    # cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer


@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]



def main(model_hidden_dim: int = 768):
    
    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()


if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Observed results and expected behavior

Please describe the observed results as well as the expected results.
If possible, attach relevant log output to help analyze your problem.
If an error is raised, please paste the full traceback of the exception.

Environment

OS/container version:
- container: nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3
- OS: Mac OS Sonoma 14.1.1 (23B81
- docker build --platform=linux/arm64 -t test-image .
Python interpreter distribution and version: python==3.10.12
pip version: 23.3
PyTriton version: 0.1.4
Deployment details: CPU only docker image

Additional context
Add any other context about the problem here.

Support for aarch64

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
I saw in the known_issues.md that only x86-64 is currently supported. Is there any official plan to support aarch64 (e.g. AGX Orin) in a near future ?
Even if not officially supported, is this feasible by changing few things ?

Describe alternatives you've considered
None

Additional context
No

Thank you and keep up the good work :)

unable to install

Description

A clear and concise description of the bug.

To reproduce

If relevant, add a minimal example so that we can reproduce the error, if necessary, by running the code. For example:

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton

@batch
def _infer_fn(**inputs):
    ...
    results_dict = model(**inputs)  # ex note: the bug is here, we expect to receive ...
    ...
    # note: observing results_dict as dictionary of numpy arrays
    return results_dict


with Triton() as triton:
    triton.bind(
        model_name="MyModel",
        infer_func=_infer_fn,
        inputs=[
            Tensor(name="in1", dtype=np.float32, shape=(-1,)),
            Tensor(name="in2", dtype=np.float32, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="out1", dtype=np.float32, shape=(-1,)),
            Tensor(name="out2", dtype=np.float32, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=128),
    )
    triton.serve()

# client
import numpy as np
from pytriton.client import ModelClient

batch_size = 2
in1_batch = np.ones((batch_size, 1), dtype=np.float32)
in2_batch = np.ones((batch_size, 1), dtype=np.float32)

with ModelClient("localhost", "MyModel") as client:
    result_batch = client.infer_batch(in1_batch, in2_batch)

Observed results and expected behavior

Environment

OS/container version: [e.g., container nvcr.io/nvidia/pytorch:23.02-py3 / virtual machine with Ubuntu 22.04]
Python interpreter distribution and version: [e.g., CPython 3.8 / conda 4.7.12 with Python 3.8 environment]
PyTriton version: [e.g., 0.1.4 / custom build from source at commit ______]
Deployment details: [e.g., multi-node multi-GPU setup on GKE / multi-GPU single-node setup in Jupyter Notebook]

Additional context
Add any other context about the problem here.
ERROR: Could not find a version that satisfies the requirement nvidia-pytriton

Support Mac installation

It would be great to be able to install pytriton on Macs for ease-of-development. Even with the lack of CUDA support for Macs, being able to develop using only the CPU would be a real time saver.

As a far reaching stretch goal, being able to use mps on Macs would also be great.

As of now pip install nvidia-pytriton throws an error being unable to install cuda-python on my machine.

My Machine:

Apple M2 Max
32GB Memory
Year - 2023
Python - 3.11.6

[problem]How to allowed multiple models running on same GPU at same time?

I'm using pytriton in my project, but I'm facing some problems.

My core codes in server sides are as follow:

start = Event(enable_timing=True)
end = Event(enable_timing=True)  

@batch
def Densenet169(tensor,num = -1):
    # logging.info(f'num:{num}')
    tensor = torch.from_numpy(tensor).cuda(cuda_num)
    start.record()
    result = model2(tensor,num)
    end.record()
    synchronize()
    time = start.elapsed_time(end)
    return {'result' : result.cpu().detach().numpy(),
            'time': np.full([1,1],time,dtype=np.float32)}

batcher = DynamicBatcher(preferred_batch_size=[32])

with Triton() as triton:
    triton.bind(
        model_name="Densenet169",
        infer_func=Densenet169,
        inputs=[
            Tensor(name = 'tensor',dtype=np.float32, shape=(-1,-1,-1,)),
            Tensor(name = 'num',dtype=np.int32, shape=(1,)),
        ],
        outputs=[
            Tensor(name = 'result',dtype=np.float32, shape=(-1,)),
            Tensor(name = 'time',dtype=np.float32, shape=(1,))
        ],
        config=ModelConfig(batching=True,max_batch_size=128,batcher=batcher),
        # config=ModelConfig(batching=False),
        strict=False,
    )
    triton.serve()

As you can see, there are two inputs and two outputs in my server's setting, one output 'result' is the result of input 'tensor', and the other is CUDA running time while inference.

The 'result' is not that important but 'time' is what I need. Triton do have metrics, but only a rough number, not precise enough as I need. I want each time it cost in inference (or several times, but the fact is Triton's metrics is enough for me).

After I modify my infer_func, things are going unexpected then:

I need to maximize GPU utilize rate, so I try to increase batchsize, which send requests more frequently by open more clients, but pytriton will merge http request into one request, which will calculate more than one request running time and return as a single running time with a void return to be merged requests. Which is not what I expected.
So I tried to send more than one sample in one batch. Which is prohibited by pytriton, since input can only be a shape as a single sample. I have not tried set 'restrict' as False in this case, but I doubt if this could work, so I didn't tried it yet.
And then, I found another package called tritonclient, with that I do successfully pass more than one sample in one batch by following codes:

        request = tritonhttpclient.InferInput('tensor', y.shape, 'FP32')
        request_num = tritonhttpclient.InferInput('num', [batchsize, 1], 'INT32')
        request.set_data_from_numpy(y)
        request_num.set_data_from_numpy(num)

        response = triton_client.infer(model_name, inputs=[request, request_num], model_version='1')

Above all, I do increase the GPU utilize rate. However, two problems still bother me:

a) How to close auto merge http request without set 'batching' as False? As I mentioned in 2, pytriton will merge http request into one request, is there any way to close that? I tried to use 'DynamicBatcher' as code above, success as it is, it's still weird and not nature at all. Could there be a more elegant way?

b) How to allowed multiple models running on same GPU at same time? The GPU utilize rate is not high enough, I wish to allowed requests running immediately when server receive it. So that multiple models could running on same GPU at same time.What should I do?

c) Extra problem. Is there any method to achieve my purpose in an elegant way? I want to maximize GPU utilize rate as high as possible so that every models could influence each other out of scientific purpose.

Stub process 'REGIS_0' is not healthy

Description

Run this server in docker,and will exit after a while with the error above

To reproduce

It happened by chance and cannot be reproduced 100%.

Environment

Python interpreter distribution and version: [conda 4.7.12 with Python 3.8 environment]
pip version: [23.1.2]
PyTriton version: [0.2.5]
Deployment details: [single gpu A10]

Additional context
after first time,i solve this by adding "ipc=host" to dockercompose file.But afterwards,it happened by chance.Anyone come here to help me out!Thanks!

Client network and/or connection timeout is smaller than requested timeout_s. This may cause unexpected behavior.

Is your feature request related to a problem? Please describe.
i get this warning,can we provide a parameter to slience this warning,really annoying.

Best practices with ModelClient

This is really just a question around using ModelClient in a situation where you expect to send a large number of inference requests over time.

My use-case is a python service that listens to a RabbitMQ queue for inference requests and then sends a Triton inference call for each message in the queue. Therefore, the service that is using ModelClient will be running continuously. My question is whether I should use the context manager with ModelClient() as client... for each inference request or if I should instantiate one client object at start-up and keep it open until the service is shut down. I didn't see any examples that didn't use the context manager so I wasn't sure.
Thanks you

why python3.7 is not supported?

if there are some bugs in python3.7?

Support for ubuntu20.04

Is your feature request related to a problem? Please describe.
Now pytriton must be used on Ubuntu 22.04,this is really strict since most of people are using version<22.04.i know it is because Triton Inference Server is built on 22.04.I wonder if we can publish a lts version stick on 20.04.

Describe the solution you'd like
OS complatible version of pytriton .If people need some new feature from Triton Inference Server,they can use pytriton built on 22.04 otherwise they can use pytriton built on 20.04 which includes bug fixes and some features just for pytriton not from Triton Inference Server

Describe alternatives you've considered
if we can not do this officially,can i build pytriton from source?

Load python backend models when pytriton starts

Is your feature request related to a problem? Please describe.
We would like to load exiting triton model repository with a few python_backend models when we start pytriton. We would like to do this in colab.

Describe alternatives you've considered
We can potentially run a docker to spin up a triton server, but it's in colab where we can not run docker.

@ctr26

Binary output truncated...

Description

When a large array is converted to npz format and returned, the client received a truncated result.

To reproduce

If relevant, add a minimal example so that we can reproduce the error, if necessary, by running the code. For example:

# server
import io
import h5py
import numpy as np

from typing import Dict

from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton


class TritonInferModel:

    def __init__(self) -> None:
        self._model_name = 'TestImpl'

        # Encoder
        self._enc_ip = (Tensor(name="input_string", shape=(-1,), dtype=bytes),
                        Tensor(name="format", shape=(1, ), dtype=bytes),)

        self._enc_op = (Tensor(name="embedding", shape=(1, ), dtype=bytes),)


    def _to_npz(self, **kwargs):
        with io.BytesIO() as output:
            np.savez(output, **kwargs)
            return output.getvalue()

    def _to_h5(self, **kwargs):
        with io.BytesIO() as output:
            with h5py.File(output, 'w') as h5f:
                for label, data in kwargs.items():
                    h5f.create_dataset(label, data=data)
            return output.getvalue()

    @batch
    def encoder(self, **inputs: np.ndarray) -> Dict[str, np.ndarray]:
        embeddings={'embeddings': np.random.rand(1, 1, 512)}

        format = np.char.decode(inputs.pop("format").astype("bytes"), encoding="utf-8")
        format = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in format][0]
        if format == 'npz':
            output = self._to_npz(**embeddings)
        elif format == 'h5':
            output = self._to_h5(**embeddings)
        else:
            raise Exception(f'Unknown format {format}')

        print("=================> size", len(output), np.array([[output]], dtype=bytes).shape)

        return {"embedding": np.array([[output]], dtype=bytes), }

    def start(self, triton):
        print(f"Loading {self._model_name} encoder...")

        triton.bind(
            model_name=f"embeddings",
            infer_func=self.encoder,
            inputs=self._enc_ip,
            outputs=self._enc_op,
            config=ModelConfig(max_batch_size=8),
            strict=True,
        )

def main():
    with Triton() as triton:
        inferer = TritonInferModel()
        inferer.start(triton)
        triton.serve()

if __name__ == "__main__":
    main()

# client
import io
import h5
import numpy as np
import tritonclient.http as httpclient

from tritonclient.utils import *


encoder_model = "embeddings"

test_input = ['Input_1', 'Input_2', 'Input_3']

def test_encoder_n_decoder():
    with httpclient.InferenceServerClient("localhost:8000") as client:

        format = 'npz'
        smis_input = np.array([test_input]).astype(bytes)
        format_input = np.array([[format]]).astype(bytes)
        inputs = [
            httpclient.InferInput("input_string", smis_input.shape,
                                np_to_triton_dtype(smis_input.dtype)),
            httpclient.InferInput("format", format_input.shape,
                                np_to_triton_dtype(format_input.dtype)),
        ]

        inputs[0].set_data_from_numpy(smis_input)
        inputs[1].set_data_from_numpy(format_input)

        outputs = [
            httpclient.InferRequestedOutput("embedding"),
        ]

        response = client.infer(encoder_model,
                                inputs,
                                request_id=str(1),
                                outputs=outputs)

        result = response.get_response()
        embeddings = response.as_numpy("embedding")

        print("=" * 80)
        print(result)
        print("SMILES              : ", smis_input)
        print("Raw Content length  : ", len(embeddings[0][0]))

        embeddings = io.BytesIO(embeddings[0][0])
        if format == 'h5':
            conv_embeddings = h5.File(embeddings)
            conv_embeddings = conv_embeddings['embeddings']
        elif format == 'npz':
            conv_embeddings = np.load(embeddings)['embeddings']

        print("Embedding           : ", conv_embeddings.shape, type(conv_embeddings))
        print("=" * 80)


if __name__ == '__main__':
    test_encoder_n_decoder()

Observed results and expected behavior

Expected behavior:

Server returns an array with 4370 elements
Client receives an array with 4370 elements

Actual behavior:

Server returns an array with 4370 elements
Client receives an array with 4366 elements

Environment

PyTriton version: 0.2.5
Deployment details: Sing GPU A6000

Streaming and batching

Hi,
I'm new in pytriton.
I am trying to deploy a model and make inference with a text-generation pipeline, I managed to get the streaming to work as per the example.
I would like to know how I can scale the deployment and whether batching is compatible with streaming?

Thank you in advance

Can I use pytriton to deploy models to a remote server?

While I find the PyTriton library convenient, my plan is not to raise the server using the python process on the same machine as the backend system. Instead, I plan on running triton on a separate machine and want to know if it is possible to deploy models to it using PyTriton. Can you provide guidance on how to achieve this?

How to use postman or use json file request to request server？

Convert the image to base64 and send it to the post server in the form of json, which is difficult to achieve in triton

Are there any examples of using python multiprocessing to run multiple copies of model on same GPU

Hi, thanks for the library, this is much easier than regular triton.

Right now I'm trying to run multiple instances of the same model (concurrently) on one big GPU. Unfortunately the concurrency is not working, seems like a GIL issue. Seems I have to use python multiprocessing library to do it myself. Do you have any examples of doing this? I looked through the existing examples but didn't find anything.

is there any performance compare table between pytriton depoly with original python depoly example on sd, gpt, tts, detection etc?

what's the pytriton done on performance optimization

Document pros and cons

Is your feature request related to a problem? Please describe.
Lack of documentation on pros and cons of using PyTriton VS Triton backends directly.

Describe the solution you'd like
First, thanks for this library! It fills a gap between writing usual Python code and using Triton.

I would like to find the answers to the following questions in the documentation:

Latency: what is the latency overhead of executing an inference function in PyTriton VS implementing a TritonPythonModel with the same code in Python backend?
Concurrency: is parallel execution of inference functions limited by Python GIL? If yes, how can this be overcome?
Local model or backend: what are the pros and cons of e.g. loading a TensorFlow SavedModel and using it in a inference function VS serving the same model with the Triton TensorFlow backend?
Supported backends: how to add more backends to the built-in Triton, f.i. the TensorFlow backend
Invoke other models: how can an inference function executes inference requests on other models served by Triton, similarly to BLS?
Collocation with other services: is it advised to run other servers (e.g. a JSON-RPC server frontend) or threads (e.g. to reload a data file) in the same "server.py" process?
External services: is it advised to call external APIs (e.g. to retrieve additional features) from an inference function?

Describe alternatives you've considered
Publishing a rationale of why this project was created and a roadmap would answers some of the questions.
Having a forum where such questions can be asked would help too.

Additional context
Thank you in advance!
Feel free to ask if more information is needed.

[questions] make function call to triton inference server rather than using grpc/http

hi, it is possible to call triton inference server through a function call rather than using the grpc/http interface? thank you.

OUTPUT triton: list or tuple or any kind of Iterables

Hello, i just want to ask if I can config triton return a list/tuple... instead of Tensor ? Please give me some docs if it is possible.

TensorRT-LLM suport?

Is your feature request related to a problem? Please describe.

I can't seem to find any examples of how to distribute models that are built for tensorrt-llm. Is this a possible thing and I am missing documentation for it?

Describe the solution you'd like

Either improve documentation on how to utilize pytriton with tensorrt-llm or explain why such a combination is non-desirable or ill-formed

Describe alternatives you've considered

I've looked at the OPT-Jax example and have begun experimenting with using a Jax port of LLaMA 2 with that example.

Model Analyzer compatibility

Is it possible to use the model analyzer tool (https://github.com/triton-inference-server/model_analyzer) to help optimize models deployed with PyTriton? If so is there a quick-start or example available anywhere?

tritonclient.grpc doesn't support timeout for other commands than infer.

Description
[ WARNING] client.py:207 - tritonclient.grpc doesn't support timeout for other commands than infer. Ignoring network_timeout: 60.0.

To reproduce
ModelClient(url='grpc://172.19.16.45:8001',model_name='test')

Question
How to ignore the warning or fix this warning

while inference by running server.py and client.py why client is taking gpu memory.

Hello, i am new to the triton and try to understand it's behaviour. i am facing one confusion which is given below :-

here I am running two client requests in one server.py. Why two client.py is consuming gpu memory by showing two gpu processes when model is running in server.

here 967MiB has consumed by server.py script and 105 mb consumed by client.py files.

While multiple client request comes, is triton creating multiple instances of the same model or running on a single instance itself ?.

Failed to start the server

Hi, I am trying the example in README, but failed to start the server with the following code:

(cellpose) weiouyang@europa:~/workspace/pytriton$ python test-pytriton.py 
Triton Inference Server exited with failure. Please wait.
Internal proxy backend error. It will be closed.
Context was terminated
Traceback (most recent call last):
  File "/home/weiouyang/workspace/pytriton/pytriton/models/model.py", line 222, in _model_proxy_handshake
    socket.recv()
  File "zmq/backend/cython/socket.pyx", line 803, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 839, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 193, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 188, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ContextTerminated: Context was terminated
Traceback (most recent call last):
  File "test-pytriton.py", line 34, in <module>
    triton.serve()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 343, in serve
    self.run()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 318, in run
    self._wait_for_models()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 429, in _wait_for_models
    client.wait_for_model(timeout_s=120)
  File "/home/weiouyang/workspace/pytriton/pytriton/client/client.py", line 154, in wait_for_model
    wait_for_model_ready(self._client, self._model_name, self._model_version, timeout_s=timeout_s)
  File "/home/weiouyang/workspace/pytriton/pytriton/client/utils.py", line 224, in wait_for_model_ready
    is_model_ready = condition.wait_for(_is_model_ready, timeout=min(1.0, timeout_s))
  File "/home/weiouyang/miniconda3/envs/cellpose/lib/python3.8/threading.py", line 328, in wait_for
    result = predicate()
  File "/home/weiouyang/workspace/pytriton/pytriton/client/utils.py", line 207, in _is_model_ready
    raise PytritonClientModelUnavailableError(f"Model {model_name}/{model_version_msg} is unavailable.")
pytriton.client.exceptions.PytritonClientModelUnavailableError: Model Linear/1 is unavailable.

I installed pytriton by build from source (on main branch, commit hash: bc71f5840e909f58beaba5ecd3c4ea624e242523).

Also a minor suggestion to improve the README is to add from pytriton.decorators import batch when using the batch decorator.

What is the proxy backend in pytriton?

In Triton, the backend refers to the several types recorded in this link, while in the architecture of pytriton, proxy backend is mentioned. What is this?

How to pass priority level during inference?

I am confused on how to pass the priority value when performing inference.

For example, if I set up the DynamicBatcher with two priority levels:

batcher = DynamicBatcher(
            max_queue_delay_microseconds=1000
            preserve_ordering=False
            priority_levels=2,
            default_priority_level=2,
        )

When using the client and calling infer_batch or infer_sample where is priority supposed to passed? Looking at the docs I assumed that headers is where you would pass it, so I tried this:

client.infer_batch(..., headers={'priority': 1})

however that does not work. I couldn't find any examples or more detailed docs anywhere on how priority is supposed to be used. Any help would be appreciated.

Issues inferencing HTTP with Bart model

Hey guys, I just ran across pytriton, looks incredible! Solves alot of the headaches scaling with FastAPI...

Excuse my ignorance and being a novice to triton.

I am having a bit of an issue trying to pass in an array for this summarizer model via HTTP post
Here's my example trying to load Bart model, I keep trying to figure out how to pass in shapes: e.g. -1, 1024

I can see the summarizer work in the responses, but the output is just showing generic responses.

Any tips would be very helpful!

import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
from pytriton.decorators import batch
import numpy as np


tokenizer = BartTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-cnn-12-6")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

@batch
def _infer_fn(prompt: np.ndarray):
    prompts = np.char.decode(prompt.astype("bytes"), "utf-8")
    prompts = prompts.tolist()

    summaries = []
    for prompt in prompts:
        # Preprocess the prompt for summarization
        encoded_prompt = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True)
        input_ids = encoded_prompt.to(device)

        # Generate the summary
        summary_ids = model.generate(input_ids, num_beams=4, max_length=300, early_stopping=False)
        summary = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
        summaries.append(summary)

    return {"summary": np.char.encode(summaries, "utf-8")}

def main():
    with Triton() as triton:
        triton.bind(
            model_name="text_summarization",
            infer_func=_infer_fn,
            inputs=[
                Tensor(name="prompt", dtype=bytes, shape=(-1,)),
            ],
            outputs=[
                Tensor(name="summary", dtype=bytes, shape=(-1,)),
            ],
            config=ModelConfig(max_batch_size=8),
        )
        triton.serve()

if __name__ == "__main__":
    main()

Passing in via postman:

{
"inputs": [
{
"name": "prompt",
"shape": [3, -1], // Tried many different ways...
"datatype": "BYTES",
"data": ["prompt goes here", "prompt goes here","prompt goes here"]
}
]
}

Support for DALI and TensorRT

Hi:

Are DALI and TensorRT currently supported? If so, can corresponding examples be provided in examples？ Thanks a lot.

TypeError: a bytes-like object is required, not 'tuple'

I am receiving the following error when receiving multiple request.

Failed to process the request(s) for model instance 'MODEL_0', message: TritonModelException: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/inference_handler.py", line 124, in run
    output_tensor_infos = self.shm_response_manager.to_shm(
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 259, in to_shm
    return [
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 261, in <listcomp>
    {
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 262, in <dictcomp>
    input_name: self._wrap_array(req_res_index, input_name, req_resp.data[input_name])
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 310, in _wrap_array
    buf[:required_buffer_size] = serialized_np_array
TypeError: a bytes-like object is required, not 'tuple'

Should this be manage by the --max-batch-size param?
or I am missing another concurrent param when deploying?

This is the dockerfile

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.04-py3
ARG BUILD_FROM

FROM ${FROM_IMAGE_NAME} as base
WORKDIR /opt/app

FROM base as install-from-pypi
RUN pip install -U nvidia-pytriton

FROM install-from-pypi AS image

ENV PYTHONUNBUFFERED=1
WORKDIR /opt/app

COPY install.sh /opt/app
RUN /opt/app/install.sh

COPY client.py /opt/app
COPY server.py /opt/app

ENTRYPOINT ["python","server.py"]

Does pytriton support hot loading models?

Haven't seen any doc/api concerned about dynamic load/unload models, after the triton server is started.
Does pytriton support this feature or haven't developed yet?

Is batch decorator deprecated?

Description

After updating to 0.2.0 infer_func class with batch decorator raise error

But it works fine without decorator.

some questions

1.difference between infer_sample and infer_batch from the source code?i don't find difference between them.infer_sample will also try to batch the input if max_batch_size>1 which should be done by infer_batch.
2.in server.py,how could we return the results? the docs don't elaborate this.
for example,we have a result as below

imgs=np.ones((8，1920，1080，3),dtype=np.uint8)

which way should i use to return this value?
return [imgs] or return dict(result=imgs)
i define the triton_config as follow:

with Triton(config=triton_config) as triton:
        logger.info("Loading ResNet model.")
        triton.bind(
            model_name="ResNet",
            infer_func=resnet_factory(DEVICES),
            inputs=[
                Tensor(name="image", dtype=np.uint8, shape=(-1, -1, 3)),
            ],
            outputs=[
                Tensor(name="result", dtype=np.uint8, shape=(-1,)),
            ],
            config=ModelConfig(
                max_batch_size=32,
                batcher=DynamicBatcher(max_queue_delay_microseconds=5000),  # 5ms
            ),
        )
        logger.info("Serving inference")
        triton.serve()

Another question:
When result key's shape is defined as

shape=(-1,)
shape=(-1,-1,-1,3)
shape=(-1,-1,-1,-1,3)

the server can always successfully return the results,this is very confusing.In the above situation,result key's shape should only be (-1,-1,-1,3)?
3.i notice you add a CLI inference_timeout_s,this will raise timeout exception at client side.But the server side is still processing and can't receive new request.I recommend a more proper way which can raise timeout exception in thread at server side,When this happens,the server side will stop the processing immediately when receive new request. please refer to https://github.com/dr-luke/PyTimeoutAfter
4.What's the speed of sending request to server side from client?I decode the img at client and then send them to server side.I find this is quite slow.The transport between client and server is really time consuming.

Sorry for so many quesitons.Thanks very much!

How to check if the server-side service is still online through the client-side?

Pytriton don't nativly support pytorch or tensorflow dtype

I was trying to use pytorch in pyTriton,

nvidia-pytriton           0.2.5                    pypi_0    pypi

I need to pass a tensor to server but python raise an error as follow.

TypeError: 'torch.dtype' object is not callable

I check about source code and here is the reason
In file pytriton/model_config/generator.py

            if spec_.dtype in [np.object_, object, bytes, np.bytes_]:
                dtype = "TYPE_STRING"
            else:
                # pytype: disable=attribute-error
                dtype = spec_.dtype().dtype
                # pytype: enable=attribute-error 
                dtype = f"TYPE_{client_utils.np_to_triton_dtype(dtype)}"

which indicate only limited datatype are supported.
Is there any way make it more comfortable for passing torch tensor to server than manually convert torch tensor into bytes?
I wish pytriton could support more datatype in future version.

AttributeError: '_thread.RLock' object has no attribute '_recursion_count'

Description

Bug when upgrade to python 3.11.7.
3.11.5 is good.

Log

  File "/tmp/folderZ0Klgm/1/model.py", line 210, in execute
    meta_requests = self._put_requests_to_buffer(triton_requests)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/folderZ0Klgm/1/model.py", line 305, in _put_requests_to_buffer
    tensor_ids = self._tensor_store.put([tensor for *_, tensor in input_arrays_with_coords])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/folderZ0Klgm/1/communication.py", line 688, in put
    shm = multiprocessing.shared_memory.SharedMemory(block.shm_name, create=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/shared_memory.py", line 120, in __init__
    resource_tracker.register(self._name, "shared_memory")
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 174, in register
    self._send('REGISTER', name, rtype)
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 182, in _send
    self.ensure_running()
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 100, in ensure_running
    if self._lock._recursion_count() > 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_thread.RLock' object has no attribute '_recursion_count'

Environment

OS/container version: Ubuntu 22.04 (latest apt update)
Python interpreter distribution and version: [Python 3.11.7 from miniconda]
pip version: 23.3.1
PyTriton version: 0.4.2

triton-inference-server / pytriton Goto Github PK

pytriton's Introduction

Triton Inference Server

LATEST RELEASE

Serve a Model in 3 Easy Steps

Examples and Tutorials

Documentation

Build and Deploy

Using Triton

Preparing Models for Triton Inference Server

Configure and Use Triton Inference Server

Client Support and Examples

Extend Triton

Additional Documentation

Contributing

Reporting problems, asking questions

For more information

pytriton's People

Contributors

Stargazers

Watchers

Forkers

pytriton's Issues

Code Snippet

TensorRT Plan Model

Pressure Test

Shape Info in Callable

Metrices Info by Triton Server

Libs Version Info

Fails with unable to find shared library libtritonserver.so

Does not fail, but does not utilize Redis

Recommend Projects

Recommend Topics

Recommend Org

Fails with `unable to find shared library libtritonserver.so`

Does not fail, but does not utilize `Redis`