Giter Site home page Giter Site logo

triton-inference-server / pytriton Goto Github PK

View Code? Open in Web Editor NEW
688.0 17.0 45.0 6.13 MB

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.

Home Page: https://triton-inference-server.github.io/pytriton/

License: Apache License 2.0

Makefile 0.36% Python 94.07% Shell 5.56%
gpu inference deep-learning

pytriton's Introduction

PyTriton

Welcome to PyTriton, a Flask/FastAPI-like framework designed to streamline the use of NVIDIA's Triton Inference Server within Python environments. PyTriton enables serving Machine Learning models with ease, supporting direct deployment from Python.

For comprehensive guidance on how to deploy your models, optimize performance, and explore the API, delve into the extensive resources found in our documentation.

Features at a Glance

The distinct capabilities of PyTriton are summarized in the feature matrix:

Feature Description
Native Python support You can create any Python function and expose it as an HTTP/gRPC API.
Framework-agnostic You can run any Python code with any framework of your choice, such as: PyTorch, TensorFlow, or JAX.
Performance optimization You can benefit from dynamic batching, response cache, model pipelining, clusters, performance tracing, and GPU/CPU inference.
Decorators You can use batching decorators to handle batching and other pre-processing tasks for your inference function.
Easy installation and setup You can use a simple and familiar interface based on Flask/FastAPI for easy installation and setup.
Model clients You can access high-level model clients for HTTP/gRPC requests with configurable options and both synchronous and asynchronous API.
Streaming (alpha) You can stream partial responses from a model by serving it in a decoupled mode.

Learn more about PyTriton's architecture.

Prerequisites

Before proceeding with the installation of PyTriton, ensure your system meets the following criteria:

  • Operating System: Compatible with glibc version 2.35 or higher.
    • Primarily tested on Ubuntu 22.04.
    • Other supported OS include Debian 11+, Rocky Linux 9+, and Red Hat UBI 9+.
    • Use ldd --version to verify your glibc version.
  • Python: Version 3.8 or newer.
  • pip: Version 20.3 or newer.
  • libpython: Ensure libpython3.*.so is installed, corresponding to your Python version.

Install

The PyTriton can be installed from pypi.org by running the following command:

pip install nvidia-pytriton

Important: The Triton Inference Server binary is installed as part of the PyTriton package.

Discover more about PyTriton's installation procedures, including Docker usage, prerequisites, and insights into building binaries from source to match your specific Triton server versions.

Quick Start

The quick start presents how to run Python model in Triton Inference Server without need to change the current working environment. In the example we are using a simple Linear model.

The infer_fn is a function that takes an data tensor and returns a list with single output tensor. The @batch from batching decorators is used to handle batching for the model.

import numpy as np
from pytriton.decorators import batch

@batch
def infer_fn(data):
    result = data * np.array([[-1]], dtype=np.float32)  # Process inputs and produce result
    return [result]

In the next step, you can create the binding between the inference callable and Triton Inference Server using the bind method from pyTriton. This method takes the model name, the inference callable, the inputs and outputs tensors, and an optional model configuration object.

from pytriton.model_config import Tensor
from pytriton.triton import Triton
triton = Triton()
triton.bind(
    model_name="Linear",
    infer_func=infer_fn,
    inputs=[Tensor(name="data", dtype=np.float32, shape=(-1,)),],
    outputs=[Tensor(name="result", dtype=np.float32, shape=(-1,)),],
)
triton.run()

Finally, you can send an inference query to the model using the ModelClient class. The infer_sample method takes the input data as a numpy array and returns the output data as a numpy array. You can learn more about the ModelClient class in the clients section.

from pytriton.client import ModelClient

client = ModelClient("localhost", "Linear")
data = np.array([1, 2, ], dtype=np.float32)
print(client.infer_sample(data=data))

After the inference is done, you can stop the Triton Inference Server and close the client:

client.close()
triton.stop()

The output of the inference should be:

{'result': array([-1., -2.], dtype=float32)}

For the full example, including defining the model and binding it to the Triton server, check out our detailed Quick Start instructions. Get your model up and running, explore how to serve it, and learn how to invoke it from client applications.

The full example code can be found in examples/linear_random_pytorch.

Examples

The examples page presents various cases of serving models using PyTriton. You can find simple examples of running PyTorch, TensorFlow2, JAX, and simple Python models. Additionally, we have prepared more advanced scenarios like online learning, multi-node models, or deployment on Kubernetes using PyTriton. Each example contains instructions describing how to build and run the example. Learn more about how to use PyTriton by reviewing our examples.

Useful Links

pytriton's People

Contributors

catwell avatar getty708 avatar jkosek avatar mahimairaja avatar matthewkotila avatar nv-blazejkubiak avatar peganovanton avatar piotrm-nvidia avatar pziecina-nv avatar szalpal avatar yeahdongcn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytriton's Issues

Can I use pytriton to deploy models to a remote server?

While I find the PyTriton library convenient, my plan is not to raise the server using the python process on the same machine as the backend system. Instead, I plan on running triton on a separate machine and want to know if it is possible to deploy models to it using PyTriton. Can you provide guidance on how to achieve this?

Sagemaker example

Hi there, is there any worked example for running pytriton on Sagemaker? I see the TritonConfig class has parameters like allow_sagemaker, sagemaker_port, and such; do you have any (basic) instructions on how to use them to share please?

Does pytriton support hot loading models?

Haven't seen any doc/api concerned about dynamic load/unload models, after the triton server is started.
Does pytriton support this feature or haven't developed yet?

How to implement machine learning model, receive request as json file from client and process them

request has the form { "id": "123", "rows": [ [ ... ], [ ... ] ], "columns": [ ... ] }
code :

def undecorated_identity_fn(**inputs):
    id = inputs["id"]
    rows = inputs["rows"]
    columns = inputs["columns"]
    logger.debug(f"id data: {id} {rows} {columns}")
    raw_df = pd.DataFrame(rows, columns=columns)
    raw_df = raw_df[["feature" + str(i) for i in range(1, 17)]]
    prediction = model_1.predict(raw_df)
    is_drifted = 0
    return [{
            "id": id,
            "predictions": prediction.tolist(),
            "drift": is_drifted,
        }]

with Triton() as triton:
    logger.info("Loading Linear model.")
    triton.bind(
        model_name="prob1",
        infer_func=undecorated_identity_fn,
        inputs=[
            Tensor(name="id", dtype=np.uint8, shape=(1,)),
            Tensor(name="rows",dtype=object, shape=(1,)),
            Tensor(name="columns", dtype=object, shape=(1,)),
        ],
        outputs=[
            Tensor(name="id", dtype=np.uint8, shape=(1,)),
            Tensor(name="predictions", dtype=object, shape=(1,)),
            Tensor(name="drift", dtype=np.uint, shape=(1,)),
        ],
        config=ModelConfig(max_batch_size=128),
    )
    logger.info("Serving models")
    triton.serve()

Log error: {"error":"Unable to parse 'inputs': attempt to access non-existing object member 'inputs'"}%

Are there any examples of using python multiprocessing to run multiple copies of model on same GPU

Hi, thanks for the library, this is much easier than regular triton.

Right now I'm trying to run multiple instances of the same model (concurrently) on one big GPU. Unfortunately the concurrency is not working, seems like a GIL issue. Seems I have to use python multiprocessing library to do it myself. Do you have any examples of doing this? I looked through the existing examples but didn't find anything.

Failed to start the server

Hi, I am trying the example in README, but failed to start the server with the following code:

(cellpose) weiouyang@europa:~/workspace/pytriton$ python test-pytriton.py 
Triton Inference Server exited with failure. Please wait.
Internal proxy backend error. It will be closed.
Context was terminated
Traceback (most recent call last):
  File "/home/weiouyang/workspace/pytriton/pytriton/models/model.py", line 222, in _model_proxy_handshake
    socket.recv()
  File "zmq/backend/cython/socket.pyx", line 803, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 839, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 193, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 188, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ContextTerminated: Context was terminated
Traceback (most recent call last):
  File "test-pytriton.py", line 34, in <module>
    triton.serve()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 343, in serve
    self.run()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 318, in run
    self._wait_for_models()
  File "/home/weiouyang/workspace/pytriton/pytriton/triton.py", line 429, in _wait_for_models
    client.wait_for_model(timeout_s=120)
  File "/home/weiouyang/workspace/pytriton/pytriton/client/client.py", line 154, in wait_for_model
    wait_for_model_ready(self._client, self._model_name, self._model_version, timeout_s=timeout_s)
  File "/home/weiouyang/workspace/pytriton/pytriton/client/utils.py", line 224, in wait_for_model_ready
    is_model_ready = condition.wait_for(_is_model_ready, timeout=min(1.0, timeout_s))
  File "/home/weiouyang/miniconda3/envs/cellpose/lib/python3.8/threading.py", line 328, in wait_for
    result = predicate()
  File "/home/weiouyang/workspace/pytriton/pytriton/client/utils.py", line 207, in _is_model_ready
    raise PytritonClientModelUnavailableError(f"Model {model_name}/{model_version_msg} is unavailable.")
pytriton.client.exceptions.PytritonClientModelUnavailableError: Model Linear/1 is unavailable.

I installed pytriton by build from source (on main branch, commit hash: bc71f5840e909f58beaba5ecd3c4ea624e242523).

Also a minor suggestion to improve the README is to add from pytriton.decorators import batch when using the batch decorator.

pytriton is slower than triton

My model is a bert classification model. Triton inference is 17ms per request. Pytriton is 34ms per request. Could someone can help me?

blow is pytriton server code

import torch
from transformers import BertConfig, BertTokenizer, BertForSequenceClassification
import os
import json
import os
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
import numpy as np
from transformers import BertTokenizer
import requests


label_list = ["finance", "realty", "stocks", "education", "science", "society", "politics", "sports", "game",
                  "entertainment", ]

pretrained_bert_dir = "/models"
tokenizer = BertTokenizer.from_pretrained(pretrained_bert_dir)
bert_config = BertConfig.from_pretrained(pretrained_bert_dir, num_labels=len(label_list))

model = BertForSequenceClassification(bert_config)
model.load_state_dict(torch.load(os.path.join(pretrained_bert_dir, "bert_model.pth")))
model.to("cuda")
model.eval()


@batch
def _infer_fn(sentence: np.ndarray):
    print(sentence)
    sequences_batch = np.char.decode(sentence.astype("bytes"), "utf-8")
    print(sequences_batch)

    labels = []
    for s in sequences_batch:
        inputs = tokenizer(s[0], max_length=32, truncation="longest_first", return_tensors="pt")
        inputs = inputs.to("cuda")

        outputs = model(**inputs)
        print(outputs)

        logits = outputs[0]
        print(logits)
        label_pro = torch.max(logits.data, 1)[1].tolist()
        print(label_pro)
        label = label_list[label_pro[0]]
        print(label)
        labels.append(label)
    return {"label": np.char.encode(labels, "utf-8")}

with Triton() as triton:
    triton.bind(
        model_name="BERT",
        infer_func=_infer_fn,
        inputs=[
            Tensor(name="sentence", dtype=np.bytes_, shape=(1,)),
        ],
        outputs=[
            Tensor(name="label", dtype=np.bytes_, shape=(1,)),
        ],
        config=ModelConfig(max_batch_size=10)
    )
    triton.serve()

blow is pytriton client code

import requests

sequence = ["明天要考试了"]
d = {
  "id": "0",
  "inputs": [
    {
      "name": "sentence",
      "shape": [1,1],
      "datatype": "BYTES",
      "data": sequence
    }
  ]
}
res = requests.post(url="http://114.116.17.243:31800/v2/models/BERT/infer",json=d).json()
r = res["outputs"][0]["data"][0]
print(r)

Enabling Redis cache throws: Unable to find shared library libtritonserver.so

Description

Running the docker image (below) with redis cache settings enabled via cache_config causes the triton server to fail on launch.

When I comment out both the cache_config and cache_directory options, the server starts and runs successfully but does not utilize the redis cache.

When I uncomment the cache_config and cache_directory options, the server fails to find the libtritonserver.so shared library file.

To reproduce

FROM nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3

WORKDIR /app

COPY . .

RUN pip install poetry

# Build https://github.com/triton-inference-server/redis_cache

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y python3 python3-distutils python-is-python3 git \
    build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev curl \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
    openssh-client cmake rapidjson-dev


RUN git clone https://github.com/triton-inference-server/redis_cache.git && \
    cd redis_cache && \
    ./build.sh && \
    mkdir -p /opt/tritonserver/caches/redis && \
    cp /app/redis_cache/build/libtritoncache_redis.so /opt/tritonserver/caches/redis/libtritoncache_redis.so

RUN cd triton-server && \
    poetry config virtualenvs.create false && \
    poetry install 


# HTTP, gRPC, & metrics traffic respecitvely
EXPOSE 8000 \
       8001 \
       8002

Fails with unable to find shared library libtritonserver.so

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer


@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]



def main(model_hidden_dim: int = 768):
    
    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()


if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Does not fail, but does not utilize Redis

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig

import os
import logging
import torch
import numpy as np
from dotenv import load_dotenv
load_dotenv()

logger = logging.getLogger("triton-server.triton_server.main")

REDIS_HOST: str = os.getenv('REDIS_HOST', 'redis')
REDIS_PORT: str = os.getenv('REDIS_PORT', '6379')
MAX_BATCH_SIZE: int = int(os.getenv('MAX_BATCH_SIZE', '8'))
VERBOSE: bool = bool(os.getenv('VERBOSE', 'True'))
TRITON_SERVER_HOST: str = os.getenv('TRITON_SERVER_HOST', '0.0.0.0')
TRITON_SERVER_PORT: str = os.getenv('TRITON_SERVER_PORT', '8000')

triton_config = TritonConfig(
    # cache_config=[f"redis,host={REDIS_HOST}", f"redis,port={REDIS_PORT}"],
    http_address=TRITON_SERVER_HOST,
    http_port=TRITON_SERVER_PORT,
    # cache_directory="/opt/tritonserver/caches",
)

def load_model() -> tuple:
    ....
    return model, tokenizer


@batch
def inference(sequence_of_text: np.ndarray):
    ...
    return [last_hidden_states]



def main(model_hidden_dim: int = 768):
    
    log_level = logging.DEBUG if VERBOSE else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    with Triton(config=triton_config) as triton:
        logger.info("Loading model...")
        triton.bind(
            model_name='model',
            infer_func=inference,
            inputs=[
                Tensor(name="sequence_of_text", dtype=bytes, shape=(1, )),
            ],
            outputs=[
                Tensor(
                    name="last_hidden_state", dtype=np.float32, shape=(-1, model_hidden_dim)
                ),
            ],
            config=ModelConfig(max_batch_size=MAX_BATCH_SIZE, response_cache=True),
            strict=True,
        )
        logger.info("Serving inference")
        triton.serve()


if __name__ == "__main__":
    model, tokenizer = load_model()
    main(model.config.dim)

Observed results and expected behavior

Please describe the observed results as well as the expected results.
If possible, attach relevant log output to help analyze your problem.
If an error is raised, please paste the full traceback of the exception.


Environment

  • OS/container version:
    • container: nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3
    • OS: Mac OS Sonoma 14.1.1 (23B81
    • docker build --platform=linux/arm64 -t test-image .
  • Python interpreter distribution and version: python==3.10.12
  • pip version: 23.3
  • PyTriton version: 0.1.4
  • Deployment details: CPU only docker image

Additional context
Add any other context about the problem here.

unable to install

Description

A clear and concise description of the bug.

To reproduce

If relevant, add a minimal example so that we can reproduce the error, if necessary, by running the code. For example:

# server
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton

@batch
def _infer_fn(**inputs):
    ...
    results_dict = model(**inputs)  # ex note: the bug is here, we expect to receive ...
    ...
    # note: observing results_dict as dictionary of numpy arrays
    return results_dict


with Triton() as triton:
    triton.bind(
        model_name="MyModel",
        infer_func=_infer_fn,
        inputs=[
            Tensor(name="in1", dtype=np.float32, shape=(-1,)),
            Tensor(name="in2", dtype=np.float32, shape=(-1,)),
        ],
        outputs=[
            Tensor(name="out1", dtype=np.float32, shape=(-1,)),
            Tensor(name="out2", dtype=np.float32, shape=(-1,)),
        ],
        config=ModelConfig(max_batch_size=128),
    )
    triton.serve()
# client
import numpy as np
from pytriton.client import ModelClient

batch_size = 2
in1_batch = np.ones((batch_size, 1), dtype=np.float32)
in2_batch = np.ones((batch_size, 1), dtype=np.float32)

with ModelClient("localhost", "MyModel") as client:
    result_batch = client.infer_batch(in1_batch, in2_batch)

Observed results and expected behavior

Please describe the observed results as well as the expected results.
If possible, attach relevant log output to help analyze your problem.
If an error is raised, please paste the full traceback of the exception.


Environment

  • OS/container version: [e.g., container nvcr.io/nvidia/pytorch:23.02-py3 / virtual machine with Ubuntu 22.04]
  • Python interpreter distribution and version: [e.g., CPython 3.8 / conda 4.7.12 with Python 3.8 environment]
  • PyTriton version: [e.g., 0.1.4 / custom build from source at commit ______]
  • Deployment details: [e.g., multi-node multi-GPU setup on GKE / multi-GPU single-node setup in Jupyter Notebook]

Additional context
Add any other context about the problem here.
ERROR: Could not find a version that satisfies the requirement nvidia-pytriton

Error deploying model on Vertex AI

Description

Hi! I'm trying to deploy a StableDiffusion model in GCP Vertex AI using Pytriton backend. My code works on a local machine, and I've been able to send requests and receive inference responses.

My problem arrives when I'm trying to create an endpoint using Vertex AI. Server run fails with error:

WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait.

And then:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
...
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")

Don't know if the error with Vertex AI service is due to the server first crashing or vice-versa.

To reproduce

Attaching my server code

# server
import torch
from model import ModelWrapper
from pytriton.decorators import batch
from pytriton.model_config import DynamicBatcher, ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig, Triton
from urllib.parse import urlparse

class _InferFuncWrapper:
    """
    Class wrapper of inference func for triton. Used to also store the model variable
    """

    def __init__(self, model: torch.nn.Module):
        self._model = model

    @batch
    def __call__(self, **inputs) -> np.ndarray:
        """
        Main inference function for triton backend. Called after batch inference.
        Performs all the logic of decoding inputs, calling the model and returning
        outputs.

        Args:
            prompts: Batch of strings with the user prompts
            init_images: Batch of initial image to run the diffusion

        Returns
            image: Batch of generated images
        """
        (prompts, init_images) = inputs.values()
        # decode prompts and images
        prompts = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in prompts]
        init_images = [
            np.char.decode(enc_img.astype("bytes"), "utf-8").item()
            for enc_img in init_images
        ]
        init_images = [_decode_img(enc_img) for enc_img in init_images]
        # transform image arrays to tensors and adjust dims to torch usage
        images_tensors = torch.tensor(init_images, dtype=torch.float32).permute(
            0, 3, 1, 2
        )
        LOGGER.debug(f"Prompts: {prompts}")
        LOGGER.debug(f"{len(init_images)} images size: {init_images[0].shape}")
        LOGGER.info("Generating images...")
        # call diffusion model
        outputs = self._model.run(prompts, images_tensors)
        LOGGER.debug(f"Prepared batch response of size: {len(outputs)}")
        return {"image": np.array(outputs)}


def _parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--verbose",
        "-v",
        action="store_true",
        help="Enable verbose logging in debug mode.",
        default=True,
    )

    parser.add_argument(
        "--vertex",
        "-s",
        action="store_true",
        help="Enable copying model files from storage for vertex deployment",
        default=False,
    )

    return parser.parse_args()

def main():
    """Initialize server with model."""
    args = _parse_args()

    # initialize logging
    log_level = logging.DEBUG if args.verbose else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    if args.vertex:
        LOGGER.debug("Vertex: Loading pipeline from Vertex Storage")
        storage_path = os.environ["AIP_STORAGE_URI"]
    else:
        LOGGER.debug("Loading pipeline locally")
        storage_path = ("") # Path to local files
    
    bucket_name, subdirectory = parse_path(storage_path)
    LOGGER.debug(f"Downloading files... Started at: {time.strftime('%X')}")
    download_blob(bucket_name, subdirectory)
    LOGGER.debug(f"Files downloaded! Finished at: {time.strftime('%X')}")
    folder_path = os.path.join("src", subdirectory)

    LOGGER.debug(f"Running on device: {DEVICE}, dtype: {DTYPE}, triton_port:{PORT}")
    LOGGER.info("Loading pipeline...")
    model = ModelWrapper(logger=LOGGER, folder_path=folder_path)
    LOGGER.info("Pipeline loaded!")

    log_verbose = 1 if args.verbose else 0

    config = TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose)

    with Triton(config=config) as triton:
        # bind the model with its inference call and configuration
        triton.bind(
            model_name="StableDiffusion_Img2Img",
            infer_func=_InferFuncWrapper(model=model),
            inputs=[
                Tensor(name="prompt", dtype=np.bytes_, shape=(1,)),
                Tensor(name="init_image", dtype=np.bytes_, shape=(1,)),
            ],
            outputs=[
                Tensor(name="image", dtype=np.bytes_, shape=(1,)),
            ],
            config=ModelConfig(
                max_batch_size=4,
                batcher=DynamicBatcher(
                    max_queue_delay_microseconds=100,
                ),
            ),
            strict=True,
        )
        # serve the model for inference
        triton.serve()


if __name__ == "__main__":
    main()

When creating Vertex endpoint, server predict route is configured to:
/v2/models/StableDiffusion_Img2Img/infer

And server health route is configured to:
/v2/health/live

With Vertex port=8015, same as HTTP port set in model configuration.

Observed results and expected behavior

As stated, server runs on local machine, but fails initializing endpoint in Vertex AI.
During Vertex build, local files are correctly downloaded and model pipeline is loaded, so error is probably in triton.bind() function.
Attaching complete log output:

DEBUG - StableDiffusion_Img2Img.server: Files downloaded! Finished at: 18:28:01
DEBUG - StableDiffusion_Img2Img.server: Running on device: cuda, dtype: torch.float16, triton_port:8015
INFO - StableDiffusion_Img2Img.server: Loading pipeline..
INFO - StableDiffusion_Img2Img.server: Pipeline loaded!

...
2023-11-23 18:29:10,322 - DEBUG - pytriton.triton: Triton Inference Server binaries ready in /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver
2023-11-23 18:29:10,322 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
 2023-11-23 18:29:10,323 - DEBUG - pytriton.utils.distribution: Obtained nvidia_pytriton.libs path: /usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.triton: Starting Triton Inference
2023-11-23 18:29:10,324 - DEBUG - pytriton.server.triton_server: Triton Server binary /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver/bin/tritonserver. Environment:
{
...
}
2023-11-23 18:29:10,449 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99996042251587)
2023-11-23 18:29:12,954 - WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait
2023-11-23 18:29:12,954 - DEBUG - pytriton.server.triton_server: Triton Inference Server exit code 1
2023-11-23 18:29:12,954 - DEBUG - pytriton.triton: Got callback that tritonserver process finished
2023-11-23 15:31:10.655 Traceback (most recent call last):
2023-11-23 15:31:10.655 File "/home/app/src/server.py", line 200, in <module>
2023-11-23 18:31:10,655 - DEBUG - pytriton.triton: Cleaning model manager, tensor store and workspace.
failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
2023-11-23 18:31:10,655 - DEBUG - pytriton.utils.workspace: Cleaning workspace dir /root/.cache/pytriton/workspace_y7vpgv3x
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")
pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.

Additional steps taken

From the timeout error raised by Pytriton we've tried increasing timeout by setting monitoring_period_s in server.run() to an arbitrary high threshold.

We've also tied adapting server configuration to Vertex with:

TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose, allow_vertex_ai=true, vertex_ai_port=8080)

But getting same error.

Environment

Docker base image: nvcr.io/nvidia/pytorch:23.10-py3
Requierements:

torch @ https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl
diffusers==0.7.2
transformers==4.21.3
ftfy==6.1.1
importlib-metadata==4.13.0
nvidia-pytriton==0.4.1
Pillow==9.5
google-cloud-storage==2.10.0

Any help is appreciated!!

Dynamic Batching Seems Invalid?

Hi

I am trying to combine FastAPI with pytriton for Serving MobileNet_V2 TensorRT plan model. The inference service for the model has enabled dynamic batching and disabled response caching. I used Python's locust tool for pressure testing and manually printed the tensor shape and checked the metrics by triton server itself. It seems that dynamic batching does not work. My example code is as follows. May I ask if there is any problem with my code?

Code Snippet

# input shape: nx3x112x112, output shape=nx1000
with open("/mnt/data/codes/learning/python/pytriton/mobile_v2.plan", "rb") as f:
    runtime = trt.Runtime(trt.Logger(trt.Logger.ERROR))
    engine  = runtime.deserialize_cuda_engine(f.read())
    context = engine.create_execution_context() 


@batch
def infer_trt(**inputs: np.ndarray):
    (input1_batch,)           = inputs.values()
    input1_batch_tensor  = torch.from_numpy(input1_batch).to("cuda")

    print("input1_batch_tensor.shape = ", input1_batch_tensor.shape)

    context.set_binding_shape(0, input1_batch_tensor.shape)
    shape                   = context.get_binding_shape(1)
    output1_batch_tensor    = torch.empty(size=tuple(shape), dtype=torch.float32, device="cuda")

    bindings                = [None] * 2
    bindings[0]             = input1_batch_tensor.contiguous().data_ptr()
    bindings[1]             = output1_batch_tensor.contiguous().data_ptr()
    
    context.execute_async_v2(bindings, torch.cuda.current_stream().cuda_stream)
    output1_batch        = output1_batch_tensor.cpu().detach().numpy()
    return [output1_batch]

class ModelTrt(object):
    def __init__(self) -> None:
        self._run = False
        self.triton = Triton(config = triton_config)
        self.triton.bind(
            model_name = "MobileNet_v2",                  
            infer_func = infer_trt,
            inputs=[
                Tensor(dtype=np.float32, shape=(3, 112, 112)),    
            ],
            outputs=[
                Tensor(dtype=np.float32, shape=(1000, )),
            ],
            config = ModelConfig(batching = True,
                        max_batch_size = 16, 
                        batcher = DynamicBatcher(
                            max_queue_delay_microseconds = 50000,
                            preferred_batch_size = [8, 16],
                            preserve_ordering = True
                        ),
                        response_cache = False)
        )

     
    def run(self):
        if not self._run:
            self.triton.run()     
            self._run = True
        logger.info("Serving models")



    def __del__(self):
        if self._run:
            self.triton.stop()
        logger.info("Serving models stop")

server = ModelTrt()

server.run()

app   = FastAPI()

@app.post("/infer")
async def infer_image(request : Request):
    input1_batch = torch.randn(2, 3, 112, 112).cpu().detach().numpy() 

    with ModelClient("localhost", "MobileNet_v2") as client:
        logger.info("Sending request")
        result_dict = client.infer_batch(input1_batch)

    for output_name, output_batch in result_dict.items():
        # logger.info(f"{output_name}: {output_batch[0].shape}")
    return {output_name, output_batch[0].shape}

TensorRT Plan Model

trtexec --onnx=mobile_v2.onnx --saveEngine=mobile_v2.plan  \
        --minShapes=input:1x3x112x112  \
        --optShapes=input:1x3x112x112 \
        --maxShapes=input:16x3x112x112 

Pressure Test

# locust_perf.py
from locust import task, FastHttpUser

class WebService(FastHttpUser):
    host = "http://localhost:31010"

    @task
    def root(self):
        with self.client.request(method = "POST", url="/infer") as resp:
            assert  resp.status_code == 200
locust -f locust_perf.py --headless -u 10 -r 1

Shape Info in Callable

image

Metrices Info by Triton Server

curl localhost:8002/metrics > metrics.txt

nv_inference_request_success{model="MobileNet_v2",version="1"} 1078
nv_inference_request_failure{model="MobileNet_v2",version="1"} 0
nv_inference_count{model="MobileNet_v2",version="1"} 2156
nv_inference_exec_count{model="MobileNet_v2",version="1"} 1078

Libs Version Info

nvidia-pytriton         0.2.0
tritonclient            2.34.0

How to infer with sequence ?

It seems that the ModelClient does not support sequence inference. "sequence_start", "sequence_id", "sequence_end" can not be found in infer_sample/infer_batch.

Streaming and batching

Hi,
I'm new in pytriton.
I am trying to deploy a model and make inference with a text-generation pipeline, I managed to get the streaming to work as per the example.
I would like to know how I can scale the deployment and whether batching is compatible with streaming?

Thank you in advance

while inference by running server.py and client.py why client is taking gpu memory.

Hello, i am new to the triton and try to understand it's behaviour. i am facing one confusion which is given below :-

here I am running two client requests in one server.py. Why two client.py is consuming gpu memory by showing two gpu processes when model is running in server.

screen

here 967MiB has consumed by server.py script and 105 mb consumed by client.py files.

While multiple client request comes, is triton creating multiple instances of the same model or running on a single instance itself ?.

Best practices with ModelClient

This is really just a question around using ModelClient in a situation where you expect to send a large number of inference requests over time.

My use-case is a python service that listens to a RabbitMQ queue for inference requests and then sends a Triton inference call for each message in the queue. Therefore, the service that is using ModelClient will be running continuously. My question is whether I should use the context manager with ModelClient() as client... for each inference request or if I should instantiate one client object at start-up and keep it open until the service is shut down. I didn't see any examples that didn't use the context manager so I wasn't sure.
Thanks you

AttributeError: '_thread.RLock' object has no attribute '_recursion_count'

Description

Bug when upgrade to python 3.11.7.
3.11.5 is good.

Log

  File "/tmp/folderZ0Klgm/1/model.py", line 210, in execute
    meta_requests = self._put_requests_to_buffer(triton_requests)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/folderZ0Klgm/1/model.py", line 305, in _put_requests_to_buffer
    tensor_ids = self._tensor_store.put([tensor for *_, tensor in input_arrays_with_coords])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/folderZ0Klgm/1/communication.py", line 688, in put
    shm = multiprocessing.shared_memory.SharedMemory(block.shm_name, create=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/shared_memory.py", line 120, in __init__
    resource_tracker.register(self._name, "shared_memory")
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 174, in register
    self._send('REGISTER', name, rtype)
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 182, in _send
    self.ensure_running()
  File "/miniconda3/envs/py3.11/lib/python3.11/multiprocessing/resource_tracker.py", line 100, in ensure_running
    if self._lock._recursion_count() > 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_thread.RLock' object has no attribute '_recursion_count'

Environment

  • OS/container version: Ubuntu 22.04 (latest apt update)
  • Python interpreter distribution and version: [Python 3.11.7 from miniconda]
  • pip version: 23.3.1
  • PyTriton version: 0.4.2

Example of (or support for) Inference Callable of Triton ensemble definition

Is your feature request related to a problem? Please describe.
I currently have an automated config generator for creating proper ensemble pbtxt configs over a DAG of backends. I currently do not see a natural way of defining and/or referencing Triton ensembles with the PyTriton module.

Describe the solution you'd like
I would like to either:
(a) create the ensemble definition directly with the PyTriton module, or
(b) load all ensemble models and the associated (already generated via above method) ensemble and model configs

Describe alternatives you've considered
Our current solution mentioned in the feature request does the job, but does not utilize PyTriton's friendly interface.

Additional context
N/A

May input tensor of the Callable be GPU addressed Tensor?

Hi

As shown in the title, can GPU Tensors be inputs of the Callable, such as torch.Tensor on device? For a complex application with multiple backends connected in series, which will result in multiple H2D and D2H operations and waste a lot of time with numpy.ndarray only scheme.

TensorRT-LLM suport?

Is your feature request related to a problem? Please describe.

I can't seem to find any examples of how to distribute models that are built for tensorrt-llm. Is this a possible thing and I am missing documentation for it?

Describe the solution you'd like

Either improve documentation on how to utilize pytriton with tensorrt-llm or explain why such a combination is non-desirable or ill-formed

Describe alternatives you've considered

I've looked at the OPT-Jax example and have begun experimenting with using a Jax port of LLaMA 2 with that example.

Load python backend models when pytriton starts

Is your feature request related to a problem? Please describe.
We would like to load exiting triton model repository with a few python_backend models when we start pytriton. We would like to do this in colab.

Describe alternatives you've considered
We can potentially run a docker to spin up a triton server, but it's in colab where we can not run docker.

@ctr26

how to use instance group

Hi

I have a question about instance_group.

In TIS, i'm using instance_group in config.pbtxt.

instance_group [{
count: 3,
kind: KIND_GPU
}]

i want to use instance_group count 3 in single gpu in pytriton.

Is there a way to use instance_group as above in pytriton?

some questions

1.difference between infer_sample and infer_batch from the source code?i don't find difference between them.infer_sample will also try to batch the input if max_batch_size>1 which should be done by infer_batch.
2.in server.py,how could we return the results? the docs don't elaborate this.
for example,we have a result as below

imgs=np.ones((8,1920,1080,3),dtype=np.uint8)

which way should i use to return this value?
return [imgs] or return dict(result=imgs)
i define the triton_config as follow:

with Triton(config=triton_config) as triton:
        logger.info("Loading ResNet model.")
        triton.bind(
            model_name="ResNet",
            infer_func=resnet_factory(DEVICES),
            inputs=[
                Tensor(name="image", dtype=np.uint8, shape=(-1, -1, 3)),
            ],
            outputs=[
                Tensor(name="result", dtype=np.uint8, shape=(-1,)),
            ],
            config=ModelConfig(
                max_batch_size=32,
                batcher=DynamicBatcher(max_queue_delay_microseconds=5000),  # 5ms
            ),
        )
        logger.info("Serving inference")
        triton.serve()

Another question:
When result key's shape is defined as

shape=(-1,)
shape=(-1,-1,-1,3)
shape=(-1,-1,-1,-1,3)

the server can always successfully return the results,this is very confusing.In the above situation,result key's shape should only be (-1,-1,-1,3)?
3.i notice you add a CLI inference_timeout_s,this will raise timeout exception at client side.But the server side is still processing and can't receive new request.I recommend a more proper way which can raise timeout exception in thread at server side,When this happens,the server side will stop the processing immediately when receive new request. please refer to https://github.com/dr-luke/PyTimeoutAfter
4.What's the speed of sending request to server side from client?I decode the img at client and then send them to server side.I find this is quite slow.The transport between client and server is really time consuming.

Sorry for so many quesitons.Thanks very much!

Support for ubuntu20.04

Is your feature request related to a problem? Please describe.
Now pytriton must be used on Ubuntu 22.04,this is really strict since most of people are using version<22.04.i know it is because Triton Inference Server is built on 22.04.I wonder if we can publish a lts version stick on 20.04.

Describe the solution you'd like
OS complatible version of pytriton .If people need some new feature from Triton Inference Server,they can use pytriton built on 22.04 otherwise they can use pytriton built on 20.04 which includes bug fixes and some features just for pytriton not from Triton Inference Server

Describe alternatives you've considered
if we can not do this officially,can i build pytriton from source?

[problem]How to allowed multiple models running on same GPU at same time?

I'm using pytriton in my project, but I'm facing some problems.

My core codes in server sides are as follow:

start = Event(enable_timing=True)
end = Event(enable_timing=True)  

@batch
def Densenet169(tensor,num = -1):
    # logging.info(f'num:{num}')
    tensor = torch.from_numpy(tensor).cuda(cuda_num)
    start.record()
    result = model2(tensor,num)
    end.record()
    synchronize()
    time = start.elapsed_time(end)
    return {'result' : result.cpu().detach().numpy(),
            'time': np.full([1,1],time,dtype=np.float32)}

batcher = DynamicBatcher(preferred_batch_size=[32])

with Triton() as triton:
    triton.bind(
        model_name="Densenet169",
        infer_func=Densenet169,
        inputs=[
            Tensor(name = 'tensor',dtype=np.float32, shape=(-1,-1,-1,)),
            Tensor(name = 'num',dtype=np.int32, shape=(1,)),
        ],
        outputs=[
            Tensor(name = 'result',dtype=np.float32, shape=(-1,)),
            Tensor(name = 'time',dtype=np.float32, shape=(1,))
        ],
        config=ModelConfig(batching=True,max_batch_size=128,batcher=batcher),
        # config=ModelConfig(batching=False),
        strict=False,
    )
    triton.serve()

As you can see, there are two inputs and two outputs in my server's setting, one output 'result' is the result of input 'tensor', and the other is CUDA running time while inference.

The 'result' is not that important but 'time' is what I need. Triton do have metrics, but only a rough number, not precise enough as I need. I want each time it cost in inference (or several times, but the fact is Triton's metrics is enough for me).

After I modify my infer_func, things are going unexpected then:

  1. I need to maximize GPU utilize rate, so I try to increase batchsize, which send requests more frequently by open more clients, but pytriton will merge http request into one request, which will calculate more than one request running time and return as a single running time with a void return to be merged requests. Which is not what I expected.
  2. So I tried to send more than one sample in one batch. Which is prohibited by pytriton, since input can only be a shape as a single sample. I have not tried set 'restrict' as False in this case, but I doubt if this could work, so I didn't tried it yet.
  3. And then, I found another package called tritonclient, with that I do successfully pass more than one sample in one batch by following codes:
        request = tritonhttpclient.InferInput('tensor', y.shape, 'FP32')
        request_num = tritonhttpclient.InferInput('num', [batchsize, 1], 'INT32')
        request.set_data_from_numpy(y)
        request_num.set_data_from_numpy(num)

        response = triton_client.infer(model_name, inputs=[request, request_num], model_version='1')
  1. Above all, I do increase the GPU utilize rate. However, two problems still bother me:

    a) How to close auto merge http request without set 'batching' as False? As I mentioned in 2, pytriton will merge http request into one request, is there any way to close that? I tried to use 'DynamicBatcher' as code above, success as it is, it's still weird and not nature at all. Could there be a more elegant way?

    b) How to allowed multiple models running on same GPU at same time? The GPU utilize rate is not high enough, I wish to allowed requests running immediately when server receive it. So that multiple models could running on same GPU at same time.What should I do?

    c) Extra problem. Is there any method to achieve my purpose in an elegant way? I want to maximize GPU utilize rate as high as possible so that every models could influence each other out of scientific purpose.

Binary output truncated...

Description

When a large array is converted to npz format and returned, the client received a truncated result.

To reproduce

If relevant, add a minimal example so that we can reproduce the error, if necessary, by running the code. For example:

# server
import io
import h5py
import numpy as np

from typing import Dict

from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton


class TritonInferModel:

    def __init__(self) -> None:
        self._model_name = 'TestImpl'

        # Encoder
        self._enc_ip = (Tensor(name="input_string", shape=(-1,), dtype=bytes),
                        Tensor(name="format", shape=(1, ), dtype=bytes),)

        self._enc_op = (Tensor(name="embedding", shape=(1, ), dtype=bytes),)


    def _to_npz(self, **kwargs):
        with io.BytesIO() as output:
            np.savez(output, **kwargs)
            return output.getvalue()

    def _to_h5(self, **kwargs):
        with io.BytesIO() as output:
            with h5py.File(output, 'w') as h5f:
                for label, data in kwargs.items():
                    h5f.create_dataset(label, data=data)
            return output.getvalue()

    @batch
    def encoder(self, **inputs: np.ndarray) -> Dict[str, np.ndarray]:
        embeddings={'embeddings': np.random.rand(1, 1, 512)}

        format = np.char.decode(inputs.pop("format").astype("bytes"), encoding="utf-8")
        format = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in format][0]
        if format == 'npz':
            output = self._to_npz(**embeddings)
        elif format == 'h5':
            output = self._to_h5(**embeddings)
        else:
            raise Exception(f'Unknown format {format}')

        print("=================> size", len(output), np.array([[output]], dtype=bytes).shape)

        return {"embedding": np.array([[output]], dtype=bytes), }

    def start(self, triton):
        print(f"Loading {self._model_name} encoder...")

        triton.bind(
            model_name=f"embeddings",
            infer_func=self.encoder,
            inputs=self._enc_ip,
            outputs=self._enc_op,
            config=ModelConfig(max_batch_size=8),
            strict=True,
        )

def main():
    with Triton() as triton:
        inferer = TritonInferModel()
        inferer.start(triton)
        triton.serve()

if __name__ == "__main__":
    main()
# client
import io
import h5
import numpy as np
import tritonclient.http as httpclient

from tritonclient.utils import *


encoder_model = "embeddings"

test_input = ['Input_1', 'Input_2', 'Input_3']

def test_encoder_n_decoder():
    with httpclient.InferenceServerClient("localhost:8000") as client:

        format = 'npz'
        smis_input = np.array([test_input]).astype(bytes)
        format_input = np.array([[format]]).astype(bytes)
        inputs = [
            httpclient.InferInput("input_string", smis_input.shape,
                                np_to_triton_dtype(smis_input.dtype)),
            httpclient.InferInput("format", format_input.shape,
                                np_to_triton_dtype(format_input.dtype)),
        ]

        inputs[0].set_data_from_numpy(smis_input)
        inputs[1].set_data_from_numpy(format_input)

        outputs = [
            httpclient.InferRequestedOutput("embedding"),
        ]

        response = client.infer(encoder_model,
                                inputs,
                                request_id=str(1),
                                outputs=outputs)

        result = response.get_response()
        embeddings = response.as_numpy("embedding")

        print("=" * 80)
        print(result)
        print("SMILES              : ", smis_input)
        print("Raw Content length  : ", len(embeddings[0][0]))

        embeddings = io.BytesIO(embeddings[0][0])
        if format == 'h5':
            conv_embeddings = h5.File(embeddings)
            conv_embeddings = conv_embeddings['embeddings']
        elif format == 'npz':
            conv_embeddings = np.load(embeddings)['embeddings']

        print("Embedding           : ", conv_embeddings.shape, type(conv_embeddings))
        print("=" * 80)


if __name__ == '__main__':
    test_encoder_n_decoder()

Observed results and expected behavior

Expected behavior:

  • Server returns an array with 4370 elements
  • Client receives an array with 4370 elements

Actual behavior:

  • Server returns an array with 4370 elements
  • Client receives an array with 4366 elements

Environment

  • PyTriton version: 0.2.5
  • Deployment details: Sing GPU A6000

Issues inferencing HTTP with Bart model

Hey guys, I just ran across pytriton, looks incredible! Solves alot of the headaches scaling with FastAPI...

Excuse my ignorance and being a novice to triton.

I am having a bit of an issue trying to pass in an array for this summarizer model via HTTP post
Here's my example trying to load Bart model, I keep trying to figure out how to pass in shapes: e.g. -1, 1024

I can see the summarizer work in the responses, but the output is just showing generic responses.

Any tips would be very helpful!

import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
from pytriton.decorators import batch
import numpy as np


tokenizer = BartTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = BartForConditionalGeneration.from_pretrained("sshleifer/distilbart-cnn-12-6")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

@batch
def _infer_fn(prompt: np.ndarray):
    prompts = np.char.decode(prompt.astype("bytes"), "utf-8")
    prompts = prompts.tolist()

    summaries = []
    for prompt in prompts:
        # Preprocess the prompt for summarization
        encoded_prompt = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True)
        input_ids = encoded_prompt.to(device)

        # Generate the summary
        summary_ids = model.generate(input_ids, num_beams=4, max_length=300, early_stopping=False)
        summary = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
        summaries.append(summary)

    return {"summary": np.char.encode(summaries, "utf-8")}

def main():
    with Triton() as triton:
        triton.bind(
            model_name="text_summarization",
            infer_func=_infer_fn,
            inputs=[
                Tensor(name="prompt", dtype=bytes, shape=(-1,)),
            ],
            outputs=[
                Tensor(name="summary", dtype=bytes, shape=(-1,)),
            ],
            config=ModelConfig(max_batch_size=8),
        )
        triton.serve()

if __name__ == "__main__":
    main()

Passing in via postman:

{
"inputs": [
{
"name": "prompt",
"shape": [3, -1], // Tried many different ways...
"datatype": "BYTES",
"data": ["prompt goes here", "prompt goes here","prompt goes here"]
}
]
}

Pytriton don't nativly support pytorch or tensorflow dtype

I was trying to use pytorch in pyTriton,

nvidia-pytriton           0.2.5                    pypi_0    pypi

I need to pass a tensor to server but python raise an error as follow.

TypeError: 'torch.dtype' object is not callable

I check about source code and here is the reason
In file pytriton/model_config/generator.py

            if spec_.dtype in [np.object_, object, bytes, np.bytes_]:
                dtype = "TYPE_STRING"
            else:
                # pytype: disable=attribute-error
                dtype = spec_.dtype().dtype
                # pytype: enable=attribute-error 
                dtype = f"TYPE_{client_utils.np_to_triton_dtype(dtype)}"

which indicate only limited datatype are supported.
Is there any way make it more comfortable for passing torch tensor to server than manually convert torch tensor into bytes?
I wish pytriton could support more datatype in future version.

Support for aarch64

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
I saw in the known_issues.md that only x86-64 is currently supported. Is there any official plan to support aarch64 (e.g. AGX Orin) in a near future ?
Even if not officially supported, is this feasible by changing few things ?

Describe alternatives you've considered
None

Additional context
No

Thank you and keep up the good work :)

Document pros and cons

Is your feature request related to a problem? Please describe.
Lack of documentation on pros and cons of using PyTriton VS Triton backends directly.

Describe the solution you'd like
First, thanks for this library! It fills a gap between writing usual Python code and using Triton.

I would like to find the answers to the following questions in the documentation:

  1. Latency: what is the latency overhead of executing an inference function in PyTriton VS implementing a TritonPythonModel with the same code in Python backend?
  2. Concurrency: is parallel execution of inference functions limited by Python GIL? If yes, how can this be overcome?
  3. Local model or backend: what are the pros and cons of e.g. loading a TensorFlow SavedModel and using it in a inference function VS serving the same model with the Triton TensorFlow backend?
  4. Supported backends: how to add more backends to the built-in Triton, f.i. the TensorFlow backend
  5. Invoke other models: how can an inference function executes inference requests on other models served by Triton, similarly to BLS?
  6. Collocation with other services: is it advised to run other servers (e.g. a JSON-RPC server frontend) or threads (e.g. to reload a data file) in the same "server.py" process?
  7. External services: is it advised to call external APIs (e.g. to retrieve additional features) from an inference function?

Describe alternatives you've considered
Publishing a rationale of why this project was created and a roadmap would answers some of the questions.
Having a forum where such questions can be asked would help too.

Additional context
Thank you in advance!
Feel free to ask if more information is needed.

TypeError: a bytes-like object is required, not 'tuple'

I am receiving the following error when receiving multiple request.

Failed to process the request(s) for model instance 'MODEL_0', message: TritonModelException: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/inference_handler.py", line 124, in run
    output_tensor_infos = self.shm_response_manager.to_shm(
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 259, in to_shm
    return [
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 261, in <listcomp>
    {
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 262, in <dictcomp>
    input_name: self._wrap_array(req_res_index, input_name, req_resp.data[input_name])
  File "/usr/local/lib/python3.8/dist-packages/pytriton/proxy/communication.py", line 310, in _wrap_array
    buf[:required_buffer_size] = serialized_np_array
TypeError: a bytes-like object is required, not 'tuple'

Should this be manage by the --max-batch-size param?
or I am missing another concurrent param when deploying?

This is the dockerfile

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.04-py3
ARG BUILD_FROM

FROM ${FROM_IMAGE_NAME} as base
WORKDIR /opt/app

FROM base as install-from-pypi
RUN pip install -U nvidia-pytriton

FROM install-from-pypi AS image

ENV PYTHONUNBUFFERED=1
WORKDIR /opt/app

COPY install.sh /opt/app
RUN /opt/app/install.sh

COPY client.py /opt/app
COPY server.py /opt/app

ENTRYPOINT ["python","server.py"]

Support for DALI and TensorRT

Hi:

Are DALI and TensorRT currently supported? If so, can corresponding examples be provided in examples? Thanks a lot.

How to pass priority level during inference?

I am confused on how to pass the priority value when performing inference.

For example, if I set up the DynamicBatcher with two priority levels:

batcher = DynamicBatcher(
            max_queue_delay_microseconds=1000
            preserve_ordering=False
            priority_levels=2,
            default_priority_level=2,
        )

When using the client and calling infer_batch or infer_sample where is priority supposed to passed? Looking at the docs I assumed that headers is where you would pass it, so I tried this:

client.infer_batch(..., headers={'priority': 1})

however that does not work. I couldn't find any examples or more detailed docs anywhere on how priority is supposed to be used. Any help would be appreciated.

Support Mac installation

It would be great to be able to install pytriton on Macs for ease-of-development. Even with the lack of CUDA support for Macs, being able to develop using only the CPU would be a real time saver.

As a far reaching stretch goal, being able to use mps on Macs would also be great.

As of now pip install nvidia-pytriton throws an error being unable to install cuda-python on my machine.

My Machine:

  • Apple M2 Max
  • 32GB Memory
  • Year - 2023
  • Python - 3.11.6

Stub process 'REGIS_0' is not healthy

Description

Run this server in docker,and will exit after a while with the error above

To reproduce

It happened by chance and cannot be reproduced 100%.

Environment

  • Python interpreter distribution and version: [conda 4.7.12 with Python 3.8 environment]
  • pip version: [23.1.2]
  • PyTriton version: [0.2.5]
  • Deployment details: [single gpu A10]

Additional context
after first time,i solve this by adding "ipc=host" to dockercompose file.But afterwards,it happened by chance.Anyone come here to help me out!Thanks!

Is batch decorator deprecated?

Description

After updating to 0.2.0 infer_func class with batch decorator raise error
image

But it works fine without decorator.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.