Giter Site home page Giter Site logo

triton-inference-server / fil_backend Goto Github PK

View Code? Open in Web Editor NEW
67.0 19.0 35.0 1.49 MB

FIL backend for the Triton Inference Server

License: Apache License 2.0

CMake 1.64% C++ 12.76% Dockerfile 0.95% Shell 4.85% Python 5.75% Jupyter Notebook 68.31% C 5.75%

fil_backend's Introduction

License

Triton Inference Server FIL Backend

Triton is a machine learning inference server for easy and highly optimized deployment of models trained in almost any major framework. This backend specifically facilitates use of tree models in Triton (including models trained with XGBoost, LightGBM, Scikit-Learn, and cuML).

If you want to deploy a tree-based model for optimized real-time or batched inference in production, the FIL backend for Triton will allow you to do just that.

Table of Contents

Usage Information

Contributor Docs

Not sure where to start?

If you aren't sure where to start with this documentation, consider one of the following paths:

I currently use XGBoost/LightGBM or other tree models and am trying to assess if Triton is the right solution for production deployment of my models

  1. Check out the FIL backend's blog post announcement
  2. Make sure your model is supported by looking at the model support section
  3. Look over the introductory example
  4. Try deploying your own model locally by consulting the FAQ notebook.
  5. Check out the main Triton documentation for additional features and helpful tips on deployment (including example Helm charts).

I am familiar with Triton, but I am using it to deploy an XGBoost/LightGBM model for the first time.

  1. Look over the introductory example
  2. Try deploying your own model locally by consulting the FAQ notebook. Note that it includes specific example code for serialization of XGBoost and LightGBM models.
  3. Review the FAQ notebook's tips for optimizing model performance.

I am familiar with Triton and the FIL backend, but I am using it to deploy a Scikit-Learn or cuML tree model for the first time

  1. Look at the section on preparing Scikit-Learn/cuML models for Triton.
  2. Try deploying your model by consulting the FAQ notebook, especially the sections on Scikit-Learn and cuML.

I am a data scientist familiar with tree model training, and I am trying to understand how Triton might be used with my models.

  1. Take a glance at the Triton product page to get a sense of what Triton is used for.
  2. Download and run the introductory example for yourself. If you do not have access to a GPU locally, you can just look over this notebook and then jump to the FAQ notebook which has specific information on CPU-only training and deployment. I have never worked with tree models before.
  3. Take a look at XGBoost's documentation.
  4. Download and run the introductory example for yourself.
  5. Try deploying your own model locally by consulting the FAQ notebook.

I don't like reading docs.

  1. Look at the Quickstart below
  2. Open the FAQs notebook in a browser.
  3. Try deploying your model. If you get stuck, Ctrl-F for keywords on the FAQ page.

Quickstart: Deploying a tree model in 3 steps

  1. Copy your model into the following directory structure. In this example, we show an XGBoost json file, but XGBoost binary files, LightGBM text files, and Treelite checkpoint files are also supported.
model_repository/
├─ example/
│  ├─ 1/
│  │  ├─ model.json
│  ├─ config.pbtxt
  1. Fill out config.pbtxt as follows, replacing $NUM_FEATURES with the number of input features, $MODEL_TYPE with xgboost, xgboost_json, lightgbm or treelite_checkpoint, and $IS_A_CLASSIFIER with true or false depending on whether this is a classifier or regressor.
backend: "fil"
max_batch_size: 32768
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ $NUM_FEATURES ]
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
instance_group [{ kind: KIND_AUTO }]
parameters [
  {
    key: "model_type"
    value: { string_value: "$MODEL_TYPE" }
  },
  {
    key: "output_class"
    value: { string_value: "$IS_A_CLASSIFIER" }
  }
]

dynamic_batching {}
  1. Start the server:
docker run -p 8000:8000 -p 8001:8001 --gpus all \
  -v ${PWD}/model_repository:/models \
  nvcr.io/nvidia/tritonserver:23.09-py3 \
  tritonserver --model-repository=/models

The Triton server will now be serving your model over both HTTP (port 8000) and GRPC (port 8001) using NVIDIA GPUs if they are available or the CPU if they are not. For information on how to submit inference requests, how to deploy other tree model types, or advanced configuration options, check out the FAQ notebook.

fil_backend's People

Contributors

abhisheksawarkar avatar ahjdzx avatar aroraakshit avatar daxiongshu avatar divyegala avatar dyastremsky avatar erikrene avatar guanluo avatar hcho3 avatar jfurtek avatar jmarshall-medallia avatar kthui avatar lowener avatar mc-nv avatar nealvaidya avatar rafvasq avatar ramitchell avatar tabrizian avatar viclafargue avatar wphicks avatar yanshenchun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fil_backend's Issues

Generic method to (de)serialize Treelite handle

Related: #6 (comment)

Currently, it is not possible to (de)serialize a Treelite handle. Consequently, you'd need to load cuML RF and sklearn tree models as Python objects. We do not want to add Python dependency to Triton.

Proposal: Add a generic method to (de)serialize a Treelite handle. The handle can represent any of the tree models that Treelite currently supports. This way, we can achieve a good separation between Treelite (which loads the model) and Triton (which consumes the model). The data flow is as follows:

  1. Treelite loads cuML RF or sklearn tree model (Python objects). Treelite is able to construct a C++ object in memory.
  2. Treelite serializes the C++ model object.
  3. Triton deserializes the C++ model object.
  4. Triton performs inference with the model.

Posting this issue here, since I don't think we want to reference this repo in Treelite's issue tracker yet.

Tests failing on NVIDIA Tesla T4, AWS G4 instance

Tests fail when using AWS G4 instance.

Steps to reproduce:

  1. Set up CUDA 11.3 (latest) and NVIDIA Docker on a fresh EC2 instance of type g4dn.8xlarge.
  2. Build the trition_fil Docker image: docker build -t triton_fil -f ops/Dockerfile .
  3. Run the CI script: LOCAL=1 ./qa/run_tests.sh

When I switched the instance to p3.2xlarge type (V100 GPU), the tests run successfully.

Error messages:

  • lightgbm model
AssertionError: 
Arrays are not almost equal to 7 decimals
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.36781818
Max relative difference: 0.36781818
 x: array([0.6321818], dtype=float32)
 y: array([1.], dtype=float32)
  • xgboost model
AssertionError:                                                                                                                                                           
Arrays are not almost equal to 7 decimals                                                                                                                                 
Mismatched elements: 1 / 1 (100%)         
Max absolute difference: 2.                                                          
Max relative difference: 1.               
 x: array([0.], dtype=float32)                                                                                                                                            
 y: array([2.], dtype=float32)

Provide example notebook

Create example notebook demonstrating usage with both http and grpc as well as both xgboost and lightgbm

Add ability to build triton_fil Docker image with cuML nightly

Customers may want to try out new improvements that are available in nightly builds of cuML, e.g. improvements in FIL or Treelite. It would be great to provide an option to build triton_fil Docker image with a nightly version of cuML.

Currently, trition_fil is built with the latest stable version of cuML.

Model with an incorrect name should throw an error

I put my XGBoost model in the following directory structure:

$ find .
.
./fil
./fil/1
./fil/1/mushroom.model
./fil/config.pbtxt

and started the Triton server with this command

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 --gpus=0 --rm -p 8000:8000 \
    -p 8001:8001 -p 8002:8002 -v $PWD:/models triton_fil tritonserver --model-repository=/models

The server silently crashes without displaying any error message:

=============================
== Triton Inference Server ==
=============================
NVIDIA Release 21.04 (build 22449183)
Copyright (c) 2018-2021, NVIDIA CORPORATION.  All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: This container was built for NVIDIA Driver Release 465.19 or later, but
       version 450.51.06 was detected and compatibility mode is UNAVAILABLE.
       [[Forward compatibility was attempted on non supported HW (CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE) cuInit()=804]]
I0504 21:02:54.689217 1 metrics.cc:228] Collecting metrics for GPU 0: Quadro RTX 8000
I0504 21:02:54.948036 1 libtorch.cc:932] TRITONBACKEND_Initialize: pytorch
I0504 21:02:54.948063 1 libtorch.cc:942] Triton TRITONBACKEND API version: 1.0
I0504 21:02:54.948067 1 libtorch.cc:948] 'pytorch' TRITONBACKEND API version: 1.0
2021-05-04 21:02:55.101613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0504 21:02:55.141980 1 tensorflow.cc:2165] TRITONBACKEND_Initialize: tensorflow
I0504 21:02:55.142029 1 tensorflow.cc:2175] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.142042 1 tensorflow.cc:2181] 'tensorflow' TRITONBACKEND API version: 1.0
I0504 21:02:55.142053 1 tensorflow.cc:2205] backend configuration:
{}
I0504 21:02:55.143758 1 onnxruntime.cc:1722] TRITONBACKEND_Initialize: onnxruntime
I0504 21:02:55.143775 1 onnxruntime.cc:1732] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.143779 1 onnxruntime.cc:1738] 'onnxruntime' TRITONBACKEND API version: 1.0
I0504 21:02:55.159289 1 openvino.cc:1168] TRITONBACKEND_Initialize: openvino
I0504 21:02:55.159303 1 openvino.cc:1178] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.159307 1 openvino.cc:1184] 'openvino' TRITONBACKEND API version: 1.0
I0504 21:02:55.301807 1 pinned_memory_manager.cc:206] Pinned memory pool is created at '0x7f3bd4000000' with size 268435456
I0504 21:02:55.302252 1 cuda_memory_manager.cc:103] CUDA memory pool is created on device 0 with size 67108864
I0504 21:02:55.304084 1 model_repository_manager.cc:1066] loading: fil:1
I0504 21:02:55.464770 1 api.cu:51] TRITONBACKEND_Initialize: fil
I0504 21:02:55.464803 1 triton_utils.cc:56] Triton TRITONBACKEND API version: 1.0
I0504 21:02:55.464808 1 triton_utils.cc:64] 'fil' TRITONBACKEND API version: 1.0
I0504 21:02:55.465248 1 api.cu:77] TRITONBACKEND_ModelInitialize: fil (version 1)
I0504 21:02:55.467424 1 api.cu:126] TRITONBACKEND_ModelInstanceInitialize: fil_0 (GPU device 0)

The server crashed because the model file was incorrectly named; it should be named xgboost.model.

When the model file is incorrectly named, the server should display an appropriate error message.

Upgrade to Treelite 1.1.0

Currently, the Triton-FIL backend uses Treelite 1.0.0. We should upgrade to 1.1.0 to take advantage of the following:

  • Support JSON model files from XGBoost 1.4.0
  • Support DART models

Switch to C++17

This will allow us to (among other things) make use of std::optional rather than our hand-rolled solution.

Test end-to-end prediction pipeline with CPU-only machine

Now that CPU prediction is on the docket (#82), we should ensure that the FIL backend can run in an environment with no visible GPU. The CI can test this scenario by launching the test container without the --gpus flag.

For now, I tested #82 locally by manually launching the Docker container with no visible GPU.

Build backend with CMake directly

Currently the FIL backend is built with docker build command, it would be nice if it can be built with cmake / make pattern (similar to ORT backend) which is expected to be built in Triton's build.py script. It is okay that the CMake actually runs docker build when calling make ... (agian, ORT backend is the example where the CMake configures a docker image build to prepare the dependencies)

Intermittent failure in cuML RF test with batch size 1

In a manually-triggered CI run last night, we got the following failure in test-custom. The primary build tests passed without issue, so I'm inclined to believe this is just a flaky failure, but the reported result from Triton is obviously nonsense.

Starting tests of model cuml...
Performance statistics for cuml:
Traceback (most recent call last):
  File "/triton_fil/qa/L0_e2e/test_model.py", line 586, in <module>
    run_test(
  File "/triton_fil/qa/L0_e2e/test_model.py", line 461, in run_test
    np.testing.assert_almost_equal(
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 581, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1044, in assert_array_almost_equal
    assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,
  File "/root/miniconda3/envs/triton_test/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 842, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 148.23135
Max relative difference: 1.
 x: array([0.], dtype=float32)
 y: array([148.23135], dtype=float32)
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['bash', '/triton_fil/qa/run_tests.sh']' command failed.  (See above for error)

Add CI

  • Merge #49
  • Hook up Gitlab CI
  • Add .clang_format check
  • Add PEP8 check

Use RMM for memory allocation

Since raft::allocate will be removed from RAFT shortly, we will need to migrate from it anyway, and our performance currently suffers in certain domains due to repeated allocations. Using RMM should help with performance and provide an alternative to raft::allocate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.