Giter Site home page Giter Site logo

triton-inference-server / fil_backend Goto Github PK

View Code? Open in Web Editor NEW
68.0 68.0 35.0 1.49 MB

FIL backend for the Triton Inference Server

License: Apache License 2.0

CMake 1.64% C++ 12.76% Dockerfile 0.95% Shell 4.85% Python 5.75% Jupyter Notebook 68.31% C 5.75%

fil_backend's Introduction

Triton Inference Server

๐Ÿ“ฃ vLLM x Triton Meetup at Fort Mason on Sept 9th 4:00 - 9:00 pm

We are excited to announce that we will be hosting our Triton user meetup with the vLLM team at Fort Mason on Sept 9th 4:00 - 9:00 pm. Join us for this exclusive event where you will learn about the newest vLLM and Triton features, get a glimpse into the roadmaps, and connect with fellow users, the NVIDIA Triton and vLLM teams. Seating is limited and registration confirmation is required to attend - please register here to join the meetup.


License

[!WARNING]

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.48.0 and corresponds to the 24.07 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.07 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.07-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Configure and Use Triton Inference Server

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

  • minimal โ€“ use as little code as possible that still produces the same problem
  • complete โ€“ provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
  • verifiable โ€“ test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

fil_backend's People

Contributors

abhisheksawarkar avatar ahjdzx avatar aroraakshit avatar daxiongshu avatar divyegala avatar dyastremsky avatar erikrene avatar guanluo avatar hcho3 avatar jfurtek avatar jmarshall-medallia avatar kthui avatar lowener avatar mc-nv avatar nealvaidya avatar rafvasq avatar ramitchell avatar tabrizian avatar viclafargue avatar wphicks avatar yanshenchun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fil_backend's Issues

Provide example notebook

Create example notebook demonstrating usage with both http and grpc as well as both xgboost and lightgbm

Intermittent failure in cuML RF test with batch size 1

In a manually-triggered CI run last night, we got the following failure in test-custom. The primary build tests passed without issue, so I'm inclined to believe this is just a flaky failure, but the reported result from Triton is obviously nonsense.

Starting tests of model cuml...
Performance statistics for cuml:
Traceback (most recent call last):
  File "/triton_fil/qa/L0_e2e/test_model.py", line 586, in <module>
    run_test(
  File "/triton_fil/qa/L0_e2e/test_model.py", line 461, in run_test
    np.testing.assert_almost_equal(
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 581, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1044, in assert_array_almost_equal
    assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 842, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 148.23135
Max relative difference: 1.
 x: array([0.], dtype=float32)
 y: array([148.23135], dtype=float32)
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['bash', '/triton_fil/qa/run_tests.sh']' command failed.  (See above for error)

Build backend with CMake directly

Currently the FIL backend is built with docker build command, it would be nice if it can be built with cmake / make pattern (similar to ORT backend) which is expected to be built in Triton's build.py script. It is okay that the CMake actually runs docker build when calling make ... (agian, ORT backend is the example where the CMake configures a docker image build to prepare the dependencies)

Add CI

  • Merge #49
  • Hook up Gitlab CI
  • Add .clang_format check
  • Add PEP8 check

Switch to C++17

This will allow us to (among other things) make use of std::optional rather than our hand-rolled solution.

Add ability to build triton_fil Docker image with cuML nightly

Customers may want to try out new improvements that are available in nightly builds of cuML, e.g. improvements in FIL or Treelite. It would be great to provide an option to build triton_fil Docker image with a nightly version of cuML.

Currently, trition_fil is built with the latest stable version of cuML.

Generic method to (de)serialize Treelite handle

Related: #6 (comment)

Currently, it is not possible to (de)serialize a Treelite handle. Consequently, you'd need to load cuML RF and sklearn tree models as Python objects. We do not want to add Python dependency to Triton.

Proposal: Add a generic method to (de)serialize a Treelite handle. The handle can represent any of the tree models that Treelite currently supports. This way, we can achieve a good separation between Treelite (which loads the model) and Triton (which consumes the model). The data flow is as follows:

  1. Treelite loads cuML RF or sklearn tree model (Python objects). Treelite is able to construct a C++ object in memory.
  2. Treelite serializes the C++ model object.
  3. Triton deserializes the C++ model object.
  4. Triton performs inference with the model.

Posting this issue here, since I don't think we want to reference this repo in Treelite's issue tracker yet.

Test end-to-end prediction pipeline with CPU-only machine

Now that CPU prediction is on the docket (#82), we should ensure that the FIL backend can run in an environment with no visible GPU. The CI can test this scenario by launching the test container without the --gpus flag.

For now, I tested #82 locally by manually launching the Docker container with no visible GPU.

Model with an incorrect name should throw an error

I put my XGBoost model in the following directory structure:

$ find .
.
./fil
./fil/1
./fil/1/mushroom.model
./fil/config.pbtxt

and started the Triton server with this command

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 --gpus=0 --rm -p 8000:8000 \
    -p 8001:8001 -p 8002:8002 -v $PWD:/models triton_fil tritonserver --model-repository=/models

The server silently crashes without displaying any error message:

=============================
== Triton Inference Server ==
=============================
NVIDIA Release 21.04 (build 22449183)
Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but
       version 450.51.06 was detected and compatibility mode is UNAVAILABLE.
       [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
I0504 21:02:54.689217 1 metrics.cc:228] Collecting metrics for GPU 0: Quadro RTX 8000
I0504 21:02:54.948036 1 libtorch.cc:932] TRITONBACKEND_Initialize: pytorch
I0504 21:02:54.948063 1 libtorch.cc:942] Triton TRITONBACKEND API version: 1.0
I0504 21:02:54.948067 1 libtorch.cc:948] 'pytorch' TRITONBACKEND API version: 1.0
2021-05-04 21:02:55.101613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0504 21:02:55.141980 1 tensorflow.cc:2165] TRITONBACKEND_Initialize: tensorflow
I0504 21:02:55.142029 1 tensorflow.cc:2175] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.142042 1 tensorflow.cc:2181] 'tensorflow' TRITONBACKEND API version: 1.0
I0504 21:02:55.142053 1 tensorflow.cc:2205] backend configuration:
{}
I0504 21:02:55.143758 1 onnxruntime.cc:1722] TRITONBACKEND_Initialize: onnxruntime
I0504 21:02:55.143775 1 onnxruntime.cc:1732] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.143779 1 onnxruntime.cc:1738] 'onnxruntime' TRITONBACKEND API version: 1.0
I0504 21:02:55.159289 1 openvino.cc:1168] TRITONBACKEND_Initialize: openvino
I0504 21:02:55.159303 1 openvino.cc:1178] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.159307 1 openvino.cc:1184] 'openvino' TRITONBACKEND API version: 1.0
I0504 21:02:55.301807 1 pinned_memory_manager.cc:206] Pinned memory pool is created at '0x7f3bd4000000' with size 268435456
I0504 21:02:55.302252 1 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864
I0504 21:02:55.304084 1 model_repository_manager.cc:1066] loading: fil:1
I0504 21:02:55.464770 1 api.cu:51] TRITONBACKEND_Initialize: fil
I0504 21:02:55.464803 1 triton_utils.cc:56] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.464808 1 triton_utils.cc:64] 'fil' TRITONBACKEND API version: 1.0
I0504 21:02:55.465248 1 api.cu:77] TRITONBACKEND_ModelInitialize: fil (version 1)
I0504 21:02:55.467424 1 api.cu:126] TRITONBACKEND_ModelInstanceInitialize: fil_0 (GPU device 0)

The server crashed because the model file was incorrectly named; it should be named xgboost.model.

When the model file is incorrectly named, the server should display an appropriate error message.

Tests failing on NVIDIA Tesla T4, AWS G4 instance

Tests fail when using AWS G4 instance.

Steps to reproduce:

  1. Set up CUDA 11.3 (latest) and NVIDIA Docker on a fresh EC2 instance of type g4dn.8xlarge.
  2. Build the trition_fil Docker image: docker build -t triton_fil -f ops/Dockerfile .
  3. Run the CI script: LOCAL=1 ./qa/run_tests.sh

When I switched the instance to p3.2xlarge type (V100 GPU), the tests run successfully.

Error messages:

  • lightgbm model
AssertionError: 
Arrays are not almost equal to 7 decimals
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.36781818
Max relative difference: 0.36781818
 x: array([0.6321818], dtype=float32)
 y: array([1.], dtype=float32)
  • xgboost model
AssertionError:                                                                                                                                                           
Arrays are not almost equal to 7 decimals                                                                                                                                 
Mismatched elements: 1 / 1 (100%)         
Max absolute difference: 2.                                                          
Max relative difference: 1.               
 x: array([0.], dtype=float32)                                                                                                                                            
 y: array([2.], dtype=float32)

Use RMM for memory allocation

Since raft::allocate will be removed from RAFT shortly, we will need to migrate from it anyway, and our performance currently suffers in certain domains due to repeated allocations. Using RMM should help with performance and provide an alternative to raft::allocate.

Upgrade to Treelite 1.1.0

Currently, the Triton-FIL backend uses Treelite 1.0.0. We should upgrade to 1.1.0 to take advantage of the following:

  • Support JSON model files from XGBoost 1.4.0
  • Support DART models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.