triton-inference-server / hugectr_backend Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

CMake 2.57% C++ 37.87% Dockerfile 0.41% Shell 0.48% Jupyter Notebook 58.67%

hugectr_backend's Introduction

Triton Inference Server

📣 vLLM x Triton Meetup at Fort Mason on Sept 9th 4:00 - 9:00 pm

We are excited to announce that we will be hosting our Triton user meetup with the vLLM team at Fort Mason on Sept 9th 4:00 - 9:00 pm. Join us for this exclusive event where you will learn about the newest vLLM and Triton features, get a glimpse into the roadmaps, and connect with fellow users, the NVIDIA Triton and vLLM teams. Seating is limited and registration confirmation is required to attend - please register here to join the meetup.

[!WARNING]

LATEST RELEASE

You are currently on the main branch which tracks under-development progress towards the next release. The current release is version 2.48.0 and corresponds to the 24.07 container release on NVIDIA GPU Cloud (NGC).

Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming. Triton inference Server is part of NVIDIA AI Enterprise, a software platform that accelerates the data science pipeline and streamlines the development and deployment of production AI.

Major features include:

Supports multiple deep learning frameworks
Supports multiple machine learning frameworks
Concurrent model execution
Dynamic batching
Sequence batching and implicit state management for stateful models
Provides Backend API that allows adding custom backends and pre/post processing operations
Supports writing custom backends in python, a.k.a. Python-based backends.
Model pipelines using Ensembling or Business Logic Scripting (BLS)
HTTP/REST and GRPC inference protocols based on the community developed KServe protocol
A C API and Java API allow Triton to link directly into your application for edge and other in-process use cases
Metrics indicating GPU utilization, server throughput, server latency, and more

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.

Serve a Model in 3 Easy Steps

# Step 1: Create the example model repository
git clone -b r24.07 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models

# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.07-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

# Inference should return the following
Image '/workspace/images/mug.jpg':
    15.346230 (504) = COFFEE MUG
    13.224326 (968) = CUP
    10.422965 (505) = COFFEEPOT

Please read the QuickStart guide for additional information regarding this example. The quickstart guide also contains an example of how to launch Triton on CPU-only systems. New to Triton and wondering where to get started? Watch the Getting Started video.

Examples and Tutorials

Check out NVIDIA LaunchPad for free access to a set of hands-on labs with Triton Inference Server hosted on NVIDIA infrastructure.

Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM are located in the NVIDIA Deep Learning Examples page on GitHub. The NVIDIA Developer Zone contains additional documentation, presentations, and examples.

Documentation

Build and Deploy

The recommended way to build and use Triton Inference Server is with Docker images.

Install Triton Inference Server with Docker containers (Recommended)
Install Triton Inference Server without Docker containers
Build a custom Triton Inference Server Docker container
Build Triton Inference Server from source
Build Triton Inference Server for Windows 10
Examples for deploying Triton Inference Server with Kubernetes and Helm on GCP, AWS, and NVIDIA FleetCommand
Secure Deployment Considerations

Using Triton

Preparing Models for Triton Inference Server

The first step in using Triton to serve your models is to place one or more models into a model repository. Depending on the type of the model and on what Triton capabilities you want to enable for the model, you may need to create a model configuration for the model.

Add custom operations to Triton if needed by your model
Enable model pipelining with Model Ensemble and Business Logic Scripting (BLS)
Optimize your models setting scheduling and batching parameters and model instances.
Use the Model Analyzer tool to help optimize your model configuration with profiling
Learn how to explicitly manage what models are available by loading and unloading models

Configure and Use Triton Inference Server

Read the Quick Start Guide to run Triton Inference Server on both GPU and CPU
Triton supports multiple execution engines, called backends, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, and more
Not all the above backends are supported on every platform supported by Triton. Look at the Backend-Platform Support Matrix to learn which backends are supported on your target platform.
Learn how to optimize performance using the Performance Analyzer and Model Analyzer
Learn how to manage loading and unloading models in Triton
Send requests directly to Triton with the HTTP/REST JSON-based or gRPC protocols

Client Support and Examples

A Triton client application sends inference and other requests to Triton. The Python and C++ client libraries provide APIs to simplify this communication.

Review client examples for C++, Python, and Java
Configure HTTP and gRPC client options
Send input data (e.g. a jpeg image) directly to Triton in the body of an HTTP request without any additional metadata

Extend Triton

Triton Inference Server's architecture is specifically designed for modularity and flexibility

Customize Triton Inference Server container for your use case
Create custom backends in either C/C++ or Python
Create decoupled backends and models that can send multiple responses for a request or not send any responses for a request
Use a Triton repository agent to add functionality that operates when a model is loaded and unloaded, such as authentication, decryption, or conversion
Deploy Triton on Jetson and JetPack
Use Triton on AWS Inferentia

Additional Documentation

Contributing

Contributions to Triton Inference Server are more than welcome. To contribute please review the contribution guidelines. If you have a backend, client, example or similar contribution that is not modifying the core of Triton, then you should file a PR in the contrib repo.

Reporting problems, asking questions

We appreciate any feedback, questions or bug reporting regarding this project. When posting issues in GitHub, follow the process outlined in the Stack Overflow document. Ensure posted examples are:

minimal – use as little code as possible that still produces the same problem
complete – provide all parts needed to reproduce the problem. Check if you can strip external dependencies and still show the problem. The less time we spend on reproducing problems the more time we have to fix it
verifiable – test the code you're about to provide to make sure it reproduces the problem. Remove all other problems that are not related to your request/question.

For issues, please use the provided bug report and feature request templates.

For questions, we recommend posting in our community GitHub Discussions.

For more information

Please refer to the NVIDIA Developer Triton page for more information.

hugectr_backend's People

Contributors

Stargazers

Watchers

Forkers

lgardenhire miguelusque chen849157649 anhmike mikemckiernan spartee sezhiyanhari achillessanger cgranger-sorenson

hugectr_backend's Issues

The max_batch_size configuration not work in batching inference?

I have changed the "max_batch_size" in ps.json and /model/wdl/config.pbtxt, configured 64 , 256, 1024 and 4096 , and make pressure test with 100 http client， and check log， found inference batch not change, the reponse time also samilar in different max_batch_size configuration.

Is max_batch_size configure available in hugectr_backend ? Any suggestion about the value and function of max_batch_size?

how to put all embedding volume into gpu embedding cache?

image: merlin-inference:22.03
machine: v100
request: 10 milion request (repeat)

we tested hugectr backend. 10 million request repeat(1 epoch) ~repeat(2 epoch) ...repeat(50 epoch) then cpu usage is continuous lower. it may be because gpu embedding cache. more embedding insert gpu embedding cache then cpu usage, rps improve.

so we make small volume to fit in gpu cache. v100 gpu memory is 32GB. so we make small embedding volume 19GB.
this volume fit in gpu cache. so we expect huge cpu usage, rps improvement first epoch. but result is same.

in first epoch, cpu usage still high 70~80%...... if gpu embedding cache loading all volume in initial time. then cpu usage can be very low.

this is our ps.json. how can we load all volume to gpu embedding cache? we already set gpucacheper 1.0. but useless

{
    "supportlonglong":"true",
    "volatile_db": {
            "type":"parallel_hash_map",
            "initial_cache_rate":1.0,
            "max_get_batch_size": 100000,
            "max_set_batch_size": 100000
    },
    "models":[
        {
            "model":"meb",
            "supportlonglong":true,
            "num_of_worker_buffer_in_pool":"8",
            "num_of_refresher_buffer_in_pool":"1",
            "deployed_device_list":[0, 1, 2, 3, 4, 5, 6, 7],
            "max_batch_size":100,
            "default_value_for_each_table":[0.0],
            "hit_rate_threshold":"1.1",
            "gpucacheper":"1.0",
            "gpucache":"true",
            "cache_refresh_percentage_per_iteration":0.0,
            "sparse_files":["/infer/models/video-meb-bmtest-v3/1/meb0_sparse_0.model"],
            "dense_file":"/infer/models/video-meb-bmtest-v3/1/meb_dense_0.model",
            "network_file":"/infer/models/video-meb-bmtest-v3/1/meb.json"
        }
    ]
}

[BUG] HugeCTR backend crashed when I make pressure test with 100 concurrents，and the coredump file not generated.

HugeCTR backend crashed when I make pressure test with 100 concurrents . and the coredump file not generated.

I1215 03:27:54.113361 33740 hugectr.cc:1438] model h3_hugectr, instance h3_hugectr, executing 1 requests
I1215 03:27:54.113399 33740 hugectr.cc:1512] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I1215 03:27:54.113373 33740 http_server.cc:2727] HTTP request: 2 /v2/models/h3_hugectr/infer
I1215 03:27:54.113420 33740 hugectr.cc:1610]    input CATCOLUMN: datatype = INT64, shape = [1,576], byte_size = 4608, buffer_count = 1
I1215 03:27:54.113434 33740 model_repository_manager.cc:615] GetInferenceBackend() 'h3_hugectr' version -1
I1215 03:27:54.113466 33740 model_repository_manager.cc:615] GetInferenceBackend() 'h3_hugectr' version -1
I1215 03:27:54.113455 33740 hugectr.cc:1626]    input ROWINDEX: datatype = INT32, shape = [1,577], byte_size = 2308, buffer_count = 1
I1215 03:27:54.113478 33740 hugectr.cc:1640]    input DES: datatype = FP32, shape = [1,704], byte_size = 2816, buffer_count = 1
I1215 03:27:54.113484 33740 hugectr.cc:1665]    requested_output OUTPUT0
I1215 03:27:54.113495 33740 infer_response.cc:165] add response output: output: OUTPUT0, type: FP32, shape: [32]
I1215 03:27:54.113493 33740 infer_request.cc:524] prepared: [0x0x7f1f0800d230] request id: 1, model: h3_hugectr, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f1f0800aa88] input: ROWINDEX, type: INT32, original shape: [1,577], batch + shape: [1,577], shape: [577]
[0x0x7f1f0800edc8] input: CATCOLUMN, type: INT64, original shape: [1,576], batch + shape: [1,576], shape: [576]
[0x0x7f1f0800b7d8] input: DES, type: FP32, original shape: [1,704], batch + shape: [1,704], shape: [704]
override inputs:
inputs:
[0x0x7f1f0800b7d8] input: DES, type: FP32, original shape: [1,704], batch + shape: [1,704], shape: [704]
[0x0x7f1f0800edc8] input: CATCOLUMN, type: INT64, original shape: [1,576], batch + shape: [1,576], shape: [576]
[0x0x7f1f0800aa88] input: ROWINDEX, type: INT32, original shape: [1,577], batch + shape: [1,577], shape: [577]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I1215 03:27:54.113504 33740 http_server.cc:1051] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I1215 03:27:54.113543 33740 http_server.cc:1071] HTTP using buffer for: 'OUTPUT0', size: 128, addr: 0x7f1f44bca620
I1215 03:27:54.113581 33740 hugectr.cc:1438] model h3_hugectr, instance h3_hugectr, executing 1 requests
I1215 03:27:54.113599 33740 hugectr.cc:1512] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I1215 03:27:54.113615 33740 hugectr.cc:1610]    input CATCOLUMN: datatype = INT64, shape = [1,576], byte_size = 4608, buffer_count = 1
I1215 03:27:54.113629 33740 hugectr.cc:1626]    input ROWINDEX: datatype = INT32, shape = [1,577], byte_size = 2308, buffer_count = 1
I1215 03:27:54.113640 33740 hugectr.cc:1640]    input DES: datatype = FP32, shape = [1,704], byte_size = 2816, buffer_count = 1
I1215 03:27:54.113646 33740 hugectr.cc:1665]    requested_output OUTPUT0
I1215 03:27:54.113651 33740 infer_response.cc:165] add response output: output: OUTPUT0, type: FP32, shape: [32]
I1215 03:27:54.113663 33740 http_server.cc:1051] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I1215 03:27:54.113674 33740 http_server.cc:1071] HTTP using buffer for: 'OUTPUT0', size: 128, addr: 0x7f1fdcbc3b60
I1215 03:27:54.115386 33740 hugectr.cc:1815] *****Processing request on device***** 0 for model h3_hugectr
I1215 03:27:54.120122 33740 hugectr.cc:1815] *****Processing request on device***** 0 for model h3_hugectr
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
Signal (6) received.
terminate called recursively
Signal (6) received.
 0# 0x0000555E676798A9 in tritonserver
 1# 0x00007F22576FF210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F2257AB5911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F2257AC138C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F2257AC13F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F2257AC16A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# std::__throw_system_error(int) in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007F2257AEE0BD in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
10# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
11# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
12# 0x00007F22383F404A in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
13# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
14# 0x00007F225828B83A in /opt/tritonserver/bin/../lib/libtritonserver.so
15# 0x00007F225828C04D in /opt/tritonserver/bin/../lib/libtritonserver.so
16# 0x00007F2258140801 in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007F2258285DC7 in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007F2257AEDDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
19# 0x00007F2257F6B609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
20# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)
root@13609cdec7a0:/opt/tritonserver# ll
total 3612
drwxr-xr-x 1 root          root             4096 Nov 29 08:07  ./
drwxr-xr-x 1 root          root             4096 Oct 21 18:44  ../
-rw-r--r-- 1 root          root              641 Nov  4 19:20 '=1.21.6'
-rw-rw-r-- 1 triton-server triton-server    1485 Oct 21 17:20  LICENSE
-rw-rw-r-- 1 triton-server triton-server 3012640 Oct 21 17:20  NVIDIA_Deep_Learning_Container_License.pdf
-rw-rw-r-- 1 triton-server triton-server       7 Oct 21 17:20  TRITON_VERSION
drwxr-xr-x 1 triton-server triton-server    4096 Nov  4 19:33  backends/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  bin/
drwxr-xr-x 1 triton-server triton-server    4096 Oct 21 18:45  include/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  lib/
-rw------- 1 root          root           616982 Dec 10 02:29  nohup.out
-rwxrwxr-x 1 triton-server triton-server    4090 Oct 21 17:20  nvidia_entrypoint.sh*
drwxr-xr-x 3 triton-server triton-server    4096 Oct 21 18:48  repoagents/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  third-party-src/
root@13609cdec7a0:/opt/tritonserver# ll /opt/tritonserver/backends/hugectr/
total 400
drwxr-xr-x 2 root root   4096 Nov  4 19:33 ./
drwxr-xr-x 3 root root   4096 Nov  4 19:33 ../
-rw-r--r-- 1 root root 398904 Nov  4 19:33 libtriton_hugectr.so

my start command：

 tritonserver --model-repository=/data/gux/h3/h3_hugectr/ --backend-config=hugectr,ps=/data/gux/h3/h3_hugectr/ps.json --load-model h3_hugectr --model-control-mode=explicit --log-verbose 1 

#pressure test  command by siege

$ siege -H "Content-Type:application/json" "http://127.0.0.1:8088/xxxx/v1/api  POST < /home/t3mgr/data.json" -b -c 100  -t 5M

I1215 03:44:22.466890 66071 hugectr.cc:748] The model configuration:
{
"name": "h3_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 4096,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "h3_hugectr_0",
"kind": "KIND_GPU",
"count": 3,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"cat_feature_num": {
"string_value": "18"
},
"config": {
"string_value": "/data/gux/h3/1/dcn.json"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "1"
},
"embedding_vector_size": {
"string_value": "128"
},
"gpucacheper": {
"string_value": "0.5"
},
"des_feature_num": {
"string_value": "22"
},
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "18"
}
},
"model_warmup": []
}

HugeCTR model on sagemaker endpoint with Hierarchical Parameter Serve

Hi all,

I'm looking to train a deploy a HugeCTR model. The model is a shallow model mostly based on user-item embeddings. I have about 180K users and 8K items.
I'm not interested in updating these embeddings, thus I'm not looking for the kafka component, but I'm looking for the ability of swapping embeddings from the GPUs cache to the CPU cache to reduce latency. Is this part of the Hierarchical Parameter Serve or is natively supported by any HugeCTR model?

[BUG] HugeCTR Backend server dies with SIGBUS during server startup (22.04)

I was running the server at merlin-inference:22.04 with the same model and settings as #40, which was able to receive queries, and but it has died with SIGBUS.

run.sh

docker run --rm --runtime=nvidia --net=host -e HUGECTR_LOG_LEVEL=0 -it \
    -v `pwd`:/models \
    nvcr.io/nvidia/merlin/merlin-inference:22.04 \
        tritonserver \
        --model-repository=/models/ \
        --load-model=meb \
        --model-control-mode=explicit \
        --backend-directory=/usr/local/hugectr/backends \
        --backend-config=hugectr,ps=/models/meb/ps.json \
        --log-info=true \
        --log-verbose=0

$ ./run.sh                                                                                                                                            [525/538]
Unable to find image 'nvcr.io/nvidia/merlin/merlin-inference:22.04' locally
22.04: Pulling from nvidia/merlin/merlin-inference
4d32b49e2995: Already exists
45893188359a: Pulling fs layer
5ad1f2004580: Pulling fs layer
6ddc1d0f9183: Pulling fs layer
4cc43a803109: Waiting
e94a4481e933: Waiting
3e7e4c9bc2b1: Waiting
9463aa3f5627: Waiting
a4a0c690bc7d: Waiting
59d451175f69: Waiting
eaf45e9f32d1: Waiting
d8d16d6af76d: Waiting
9e04bda98b05: Waiting
4f4fb700ef54: Pull complete
98e1b8b4cf4b: Pull complete
3ba4cd25cab4: Pull complete
e07a05c28244: Pull complete
6a99482f27f4: Pull complete
0a9c87e68332: Pull complete
6d909763dff3: Pull complete
7f01a1b77738: Pull complete
c70caad572e6: Pull complete
c0b57c72d7c7: Pull complete
3b7c493bb8f8: Pull complete
70f21191d5fa: Pull complete
b72ef49a1648: Pull complete
1735193fce1a: Pull complete
6f0a31eb4fc9: Pull complete
5a83b81d8cfd: Pull complete
24c069e055bb: Pull complete
9c90284fcd0f: Pull complete
405c3b74edb7: Pull complete
2c2cfec47605: Pull complete
f9e5bf6b037e: Pull complete
69b1183a0dc9: Pull complete
73133bf37ddc: Pull complete
187e35d56f89: Pull complete
23ec4ade6dcd: Pull complete
4fba3dd7f97c: Pull complete
11923c954056: Pull complete
95b67db4aa6d: Pull complete
73d16c81d9c9: Pull complete
f0a024c8b08f: Pull complete
69b1183a0dc9: Pull complete
73133bf37ddc: Pull complete
187e35d56f89: Pull complete
23ec4ade6dcd: Pull complete
4fba3dd7f97c: Pull complete
11923c954056: Pull complete
95b67db4aa6d: Pull complete
73d16c81d9c9: Pull complete
f0a024c8b08f: Pull complete
099d0dd31169: Pull complete
96d82345047b: Pull complete
188b63e153b6: Pull complete
de97abb09153: Pull complete
01be5700f44b: Pull complete
9f01a696bb8b: Pull complete
3e3d4a57ff34: Pull complete
ccb0ee9eb079: Pull complete
a496569779c9: Pull complete
e9c89d74ffd4: Pull complete
e225aafaa730: Pull complete
ef7b62a5bf12: Pull complete
43449b45a07c: Pull complete
09fcbe8e254c: Pull complete
b85ef4e24a81: Pull complete
Digest: sha256:eeb55e2463291d83b8ad3d05f63a8be641dd1684daa13cafc2ea044546130fd5
Status: Downloaded newer image for nvcr.io/nvidia/merlin/merlin-inference:22.04

==================================
== Triton Inference Server Base ==
==================================

NVIDIA Release 22.03 (build 33743047)

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.73.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0509 02:01:17.740583 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f7998000000' with size 268435456
I0509 02:01:17.750397 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0509 02:01:17.750411 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0509 02:01:17.750417 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0509 02:01:17.750422 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0509 02:01:17.750428 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 4 with size 67108864
I0509 02:01:17.750434 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 5 with size 67108864
I0509 02:01:17.750439 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 6 with size 67108864
I0509 02:01:17.750446 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 7 with size 67108864
I0509 02:01:18.869713 1 model_repository_manager.cc:997] loading: meb:1
I0509 02:01:18.999772 1 hugectr.cc:1597] TRITONBACKEND_Initialize: hugectr
I0509 02:01:18.999796 1 hugectr.cc:1604] Triton TRITONBACKEND API version: 1.8
I0509 02:01:18.999803 1 hugectr.cc:1608] 'hugectr' TRITONBACKEND API version: 1.8
I0509 02:01:18.999810 1 hugectr.cc:1631] The HugeCTR backend Repository location: /usr/local/hugectr/backends/hugectr
I0509 02:01:18.999816 1 hugectr.cc:1640] The HugeCTR backend configuration: {"cmdline":{"ps":"/models/meb/ps.json"}}
I0509 02:01:18.999842 1 hugectr.cc:344] *****Parsing Parameter Server Configuration from /models/meb/ps.json
I0509 02:01:18.999897 1 hugectr.cc:365] Support 64-bit keys = 1
I0509 02:01:18.999912 1 hugectr.cc:376] Volatile database -> type = parallel_hash_map
I0509 02:01:18.999918 1 hugectr.cc:381] Volatile database -> address = 127.0.0.1:7000
I0509 02:01:18.999924 1 hugectr.cc:386] Volatile database -> user name = default
I0509 02:01:18.999932 1 hugectr.cc:390] Volatile database -> password = <empty>
I0509 02:01:18.999938 1 hugectr.cc:397] Volatile database -> algorithm = phm
I0509 02:01:18.999945 1 hugectr.cc:402] Volatile database -> number of partitions = 16
I0509 02:01:18.999951 1 hugectr.cc:408] Volatile database -> max. batch size (GET) = 100000
I0509 02:01:18.999958 1 hugectr.cc:415] Volatile database -> max. batch size (SET) = 100000
I0509 02:01:18.999964 1 hugectr.cc:423] Volatile database -> refresh time after fetch = 0
I0509 02:01:18.999972 1 hugectr.cc:430] Volatile database -> overflow margin = 120000000
I0509 02:01:18.999979 1 hugectr.cc:436] Volatile database -> overflow policy = evict_oldest
I0509 02:01:18.999999 1 hugectr.cc:442] Volatile database -> overflow resolution target = 0.8
I0509 02:01:19.000007 1 hugectr.cc:450] Volatile database -> initial cache rate = 1
I0509 02:01:19.000014 1 hugectr.cc:456] Volatile database -> cache missed embeddings = 0
I0509 02:01:19.000021 1 hugectr.cc:466] Volatile database -> update filters = []
I0509 02:01:19.000060 1 hugectr.cc:583] Model name = meb
I0509 02:01:19.000067 1 hugectr.cc:592] Model 'meb' -> network file = /models/meb/1/meb.json
I0509 02:01:19.000074 1 hugectr.cc:599] Model 'meb' -> max. batch size = 100
I0509 02:01:19.000081 1 hugectr.cc:605] Model 'meb' -> dense model file = /models/meb/1/meb_dense_0.model
I0509 02:01:19.000089 1 hugectr.cc:611] Model 'meb' -> sparse model files = [/models/meb/1/meb0_sparse_0.model]
I0509 02:01:19.000097 1 hugectr.cc:622] Model 'meb' -> use GPU embedding cache = 1                                                                                                    [406/538]
I0509 02:01:19.000106 1 hugectr.cc:631] Model 'meb' -> hit rate threshold = 0.8
I0509 02:01:19.000115 1 hugectr.cc:639] Model 'meb' -> per model GPU cache = 1
I0509 02:01:19.000132 1 hugectr.cc:655] Model 'meb' -> num. pool worker buffers = 4
I0509 02:01:19.000142 1 hugectr.cc:662] Model 'meb' -> num. pool refresh buffers = 1
I0509 02:01:19.000149 1 hugectr.cc:669] Model 'meb' -> cache refresh rate per iteration = 0.2
I0509 02:01:19.000158 1 hugectr.cc:678] Model 'meb' -> deployed device list = [0, 1, 2, 3, 4, 5, 6, 7]
I0509 02:01:19.000167 1 hugectr.cc:686] Model 'meb' -> default value for each table = [0]
I0509 02:01:19.000179 1 hugectr.cc:706] *****The HugeCTR Backend Parameter Server is creating... *****
I0509 02:01:19.000351 1 hugectr.cc:714] ***** Parameter Server(Int64) is creating... *****
I0509 02:03:04.076790 1 hugectr.cc:725] *****The HugeCTR Backend Backend created the Parameter Server successfully! *****
I0509 02:03:04.076884 1 hugectr.cc:1703] TRITONBACKEND_ModelInitialize: meb (version 1)
I0509 02:03:04.076891 1 hugectr.cc:1716] Repository location: /models/meb
I0509 02:03:04.076899 1 hugectr.cc:1731] backend configuration in mode: {"cmdline":{"ps":"/models/meb/ps.json"}}
I0509 02:03:04.078340 1 hugectr.cc:974] Verifying model configuration: {
    "name": "meb",
    "platform": "",
    "backend": "hugectr",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 100,
    "input": [
        {
            "name": "DES",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "CATCOLUMN",
            "data_type": "TYPE_INT64",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "ROWINDEX",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "meb_0",
            "kind": "KIND_GPU",
            "count": 4,
            "gpus": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "max_nnz": {
            "string_value": "8000"
        },
        "embedding_vector_size": {
            "string_value": "15"
        },
        "gpucacheper": {
            "string_value": "1.0"
        },
        "des_feature_num": {
            "string_value": "0"
        },
        "hit_rate_threshold": {
            "string_value": "0.8"
        },
        "gpucache": {
            "string_value": "true"
        },
        "embeddingkey_long_type": {
            "string_value": "true"
        },
        "slots": {
            "string_value": "15"
        },
        "config": {
            "string_value": "/models/meb/1/meb.json"
        },
        "cat_feature_num": {
            "string_value": "8000"
        },
        "label_dim": {
            "string_value": "1"
        }
    },
    "model_warmup": []
}
I0509 02:03:04.078466 1 hugectr.cc:1060] The model configuration: {
    "name": "meb",
    "platform": "",
    "backend": "hugectr",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 100,
    "input": [
        {
            "name": "DES",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "CATCOLUMN",
            "data_type": "TYPE_INT64",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "ROWINDEX",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "meb_0",
            "kind": "KIND_GPU",
            "count": 4,
            "gpus": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "max_nnz": {
            "string_value": "8000"
        },
        "embedding_vector_size": {
            "string_value": "15"
        },
        "gpucacheper": {
            "string_value": "1.0"
        },
        "des_feature_num": {
            "string_value": "0"
        },
        "hit_rate_threshold": {
            "string_value": "0.8"
        },
        "gpucache": {
            "string_value": "true"
        },
        "embeddingkey_long_type": {
            "string_value": "true"
        },
        "slots": {
            "string_value": "15"
        },
        "config": {
            "string_value": "/models/meb/1/meb.json"
        },
        "cat_feature_num": {
            "string_value": "8000"
        },
        "label_dim": {
            "string_value": "1"
        }
    },
    "model_warmup": []
}
I0509 02:03:04.078548 1 hugectr.cc:1105] slots set = 15
I0509 02:03:04.078555 1 hugectr.cc:1111] desene number = 0
I0509 02:03:04.078562 1 hugectr.cc:1117] cat_feature number = 8000
I0509 02:03:04.078569 1 hugectr.cc:1129] embedding size = 15
I0509 02:03:04.078576 1 hugectr.cc:1135] maxnnz = 8000
I0509 02:03:04.078583 1 hugectr.cc:1153] HugeCTR model config path = /models/meb/1/meb.json
I0509 02:03:04.078593 1 hugectr.cc:1176] support gpu cache = 1
I0509 02:03:04.078610 1 hugectr.cc:1199] gpu cache per = 1
I0509 02:03:04.078619 1 hugectr.cc:1216] hit-rate threshold = 0.8
I0509 02:03:04.078626 1 hugectr.cc:1232] Label dim = 1
I0509 02:03:04.078633 1 hugectr.cc:1238] support 64-bit embedding key = 1
I0509 02:03:04.078639 1 hugectr.cc:1252] Model_Inference_Para.max_batchsize: 100
I0509 02:03:04.078645 1 hugectr.cc:1256] max_batch_size in model config.pbtxt is 100
I0509 02:03:04.078654 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 0
I0509 02:03:04.078661 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 1
I0509 02:03:04.078667 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 2
I0509 02:03:04.078673 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 3
I0509 02:03:04.078680 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 4
I0509 02:03:04.078686 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 5
I0509 02:03:04.078693 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 6
I0509 02:03:04.078699 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 7
I0509 02:03:04.078705 1 hugectr.cc:1353] ******Creating Embedding Cache for model meb successfully
I0509 02:03:04.084639 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 0)
I0509 02:03:04.084655 1 hugectr.cc:1495] Triton Model Instance Initialization on device 0
I0509 02:03:04.092570 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:04.092584 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:04.119280 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:04.119445 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:04.119527 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:04.119535 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:06.406007 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:06.406216 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 1)
I0509 02:03:06.406245 1 hugectr.cc:1495] Triton Model Instance Initialization on device 1
I0509 02:03:06.406305 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:06.406313 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:06.423536 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:06.423681 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:06.423781 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:06.423789 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:07.636730 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:07.636912 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 2)
I0509 02:03:07.636927 1 hugectr.cc:1495] Triton Model Instance Initialization on device 2
I0509 02:03:07.636933 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:07.636941 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:07.653589 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:07.653740 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:07.653842 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:07.653850 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:08.864003 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:08.864193 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 3)
I0509 02:03:08.864208 1 hugectr.cc:1495] Triton Model Instance Initialization on device 3
I0509 02:03:08.864215 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:08.864239 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:08.881026 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:08.881182 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:08.881300 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:08.881309 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:10.089367 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:10.089537 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 4)
I0509 02:03:10.089549 1 hugectr.cc:1495] Triton Model Instance Initialization on device 4
I0509 02:03:10.089556 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:10.089563 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:10.106748 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:10.106901 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:10.107012 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:10.107020 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:11.315555 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:11.315748 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 5)
I0509 02:03:11.315763 1 hugectr.cc:1495] Triton Model Instance Initialization on device 5
I0509 02:03:11.315770 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:11.315776 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:11.333771 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:11.333941 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:11.334056 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:11.334065 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:12.548906 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:12.549089 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 6)
I0509 02:03:12.549103 1 hugectr.cc:1495] Triton Model Instance Initialization on device 6
I0509 02:03:12.549110 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:12.549116 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:12.566637 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:12.566807 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:12.566930 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:12.566939 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:13.789554 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:13.789746 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 7)
I0509 02:03:13.789761 1 hugectr.cc:1495] Triton Model Instance Initialization on device 7
I0509 02:03:13.789768 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:13.789775 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:13.807076 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:13.807262 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:13.807390 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:13.807399 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.054373 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.054532 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 0)
I0509 02:03:15.054545 1 hugectr.cc:1495] Triton Model Instance Initialization on device 0
I0509 02:03:15.054551 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.054558 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.059515 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.059646 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.059775 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.059784 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.435527 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.435710 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 1)
I0509 02:03:15.435724 1 hugectr.cc:1495] Triton Model Instance Initialization on device 1
I0509 02:03:15.435733 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.435742 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.440908 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.441018 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.441116 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.441124 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.809277 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.809436 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 2)
I0509 02:03:15.809449 1 hugectr.cc:1495] Triton Model Instance Initialization on device 2
I0509 02:03:15.809455 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.809462 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.814312 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.814412 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.814511 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.814519 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.182679 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.182843 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 3)
I0509 02:03:16.182857 1 hugectr.cc:1495] Triton Model Instance Initialization on device 3
I0509 02:03:16.182863 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.182870 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.188940 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.189040 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.189140 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.189148 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.567086 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.567323 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 4)
I0509 02:03:16.567338 1 hugectr.cc:1495] Triton Model Instance Initialization on device 4
I0509 02:03:16.567344 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.567350 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.572950 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.573061 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.573169 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.573177 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.954819 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.954976 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 5)
I0509 02:03:16.954989 1 hugectr.cc:1495] Triton Model Instance Initialization on device 5
I0509 02:03:16.954996 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.955001 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.960033 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.960144 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.960262 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.960270 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:17.335444 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:17.335611 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 6)
I0509 02:03:17.335624 1 hugectr.cc:1495] Triton Model Instance Initialization on device 6
I0509 02:03:17.335631 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:17.335637 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:17.340769 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:17.340877 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:17.340976 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:17.340983 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:17.725151 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:17.725387 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 7)
I0509 02:03:17.725400 1 hugectr.cc:1495] Triton Model Instance Initialization on device 7
I0509 02:03:17.725407 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:17.725412 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:17.730393 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:17.730501 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:17.730597 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:17.730606 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
[qgpuvb813-cvision:1    :0:381] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    381) ====
 0 0x00000000000143c0 __funlockfile()  ???:0
 1 0x000000000018ba51 __nss_database_lookup()  ???:0
 2 0x0000000000068d6c ncclGroupEnd()  ???:0
 3 0x000000000005de2d ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f163 clone()  ???:0
=================================

[BUG] big CATCOLUMN, ROWINDEX server crash

hello. I am using merlin-inference:22.03.

when I test hugectr backend small batch_size with CATCOLUMN, ROWINDEX server works well. but when I using big batch_size such as 50~100 then server abort with signal (6) or signal (11)

I0321 08:59:58.537734 1 infer_request.cc:675] prepared: [0x0x7fcf600059e0] request id: 1, model: meb, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcf60006d28] input: ROWINDEX, type: INT32, original shape: [1,1351], batch + shape: [1,1351], shape: [1351]
[0x0x7fcf60007448] input: CATCOLUMN, type: INT64, original shape: [1,1916], batch + shape: [1,1916], shape: [1916]
[0x0x7fcf60007b78] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
override inputs:
inputs:
[0x0x7fcf60007b78] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
[0x0x7fcf60007448] input: CATCOLUMN, type: INT64, original shape: [1,1916], batch + shape: [1,1916], shape: [1916]
[0x0x7fcf60006d28] input: ROWINDEX, type: INT32, original shape: [1,1351], batch + shape: [1,1351], shape: [1351]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I0321 08:59:58.537824 1 hugectr.cc:1988] model meb, instance meb, executing 1 requests
I0321 08:59:58.537856 1 hugectr.cc:2056] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I0321 08:59:58.537878 1 hugectr.cc:2157]        input CATCOLUMN: datatype = INT64, shape = [1,1916], byte_size = 15328, buffer_count = 1
I0321 08:59:58.537888 1 hugectr.cc:2169]        input ROWINDEX: datatype = INT32, shape = [1,1351], byte_size = 5404, buffer_count = 1
I0321 08:59:58.537896 1 hugectr.cc:2181]        input DES: datatype = FP32, shape = [1,0], byte_size = 0, buffer_count = 0
I0321 08:59:58.537904 1 hugectr.cc:2206]        requested_output OUTPUT0
I0321 08:59:58.537912 1 infer_response.cc:166] add response output: output: OUTPUT0, type: FP32, shape: [90]
I0321 08:59:58.537925 1 http_server.cc:1068] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I0321 08:59:58.537939 1 http_server.cc:1088] HTTP using buffer for: 'OUTPUT0', size: 360, addr: 0x7f7fcc2c5020
I0321 08:59:58.538025 1 hugectr.cc:2372] *****Processing request on device***** 0 for model meb
Signal (11) received.
 0# 0x000055A64FB47299 in tritonserver
 1# 0x00007FE4C8D42210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007FE4C8E8A959 in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
 4# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
 5# 0x00007FE4AC4C629D in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 6# TRITONBACKEND_ModelInstanceExecute in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 7# 0x00007FE4C98E4F9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007FE4C98E56B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007FE4C977D1A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007FE4C98DF527 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007FE4C9130DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007FE4C95AE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

[0x0x7f32f000b348] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
[0x0x7f32f0023ab8] input: CATCOLUMN, type: INT64, original shape: [1,2949], batch + shape: [1,2949], shape: [2949]
[0x0x7f32f0023c38] input: ROWINDEX, type: INT32, original shape: [1,751], batch + shape: [1,751], shape: [751]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I0324 04:46:46.684514 1 hugectr.cc:1988] model meb, instance meb, executing 1 requests
I0324 04:46:46.684549 1 hugectr.cc:2056] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I0324 04:46:46.684573 1 hugectr.cc:2157]        input CATCOLUMN: datatype = INT64, shape = [1,2949], byte_size = 23592, buffer_count = 1
I0324 04:46:46.684581 1 hugectr.cc:2169]        input ROWINDEX: datatype = INT32, shape = [1,751], byte_size = 3004, buffer_count = 1
I0324 04:46:46.684588 1 hugectr.cc:2181]        input DES: datatype = FP32, shape = [1,0], byte_size = 0, buffer_count = 0
I0324 04:46:46.684595 1 hugectr.cc:2206]        requested_output OUTPUT0
I0324 04:46:46.684605 1 infer_response.cc:166] add response output: output: OUTPUT0, type: FP32, shape: [50]
I0324 04:46:46.684620 1 http_server.cc:1068] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I0324 04:46:46.684628 1 http_server.cc:1088] HTTP using buffer for: 'OUTPUT0', size: 200, addr: 0x7f4f14002bf0
terminate called after throwing an instance of 'std::runtime_error'
  what():  Runtime error: invalid argument /repos/hugectr_inference_backend/src/hugectr.cc:2343

Signal (6) received.
 0# 0x000055ECA62F0299 in tritonserver
 1# 0x00007F568119D210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F5681553911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F568155F38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F568155F3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F568155F6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007F56760AFA9E in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 9# 0x00007F5681D3FF9A in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F5681D406B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F5681BD81A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# 0x00007F5681D3A527 in /opt/tritonserver/bin/../lib/libtritonserver.so
13# 0x00007F568158BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
14# 0x00007F5681A09609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
15# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Signal (11) received.
 0# 0x000055ECA62F0299 in tritonserver
 1# 0x00007F568119D210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F5681553911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F568155F38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F568155F3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F568155F6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F56760AFA9E in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 8# 0x00007F5681D3FF9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F5681D406B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F5681BD81A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F5681D3A527 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# 0x00007F568158BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
13# 0x00007F5681A09609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
14# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

[BUG]Load criteo model failed in triton inference

Describe the bug
Load criteo model failed in triton inference.

Steps/Code to reproduce bug

tritonserver --model-repository=/model/ --backend-config=hugectr,ps=/model/ps.json --model-control-mode=explicit
triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
triton_client.load_model(model_name="criteo")

Expected behavior
load criteo model successfully.

Environment details (please complete the following information):

Environment location: Docker
Method of NVTabular install: Docker
- docker pull nvcr.io/nvidia/merlin/merlin-inference:21.11
- docker run -itd --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /data:/data nvcr.io/nvidia/merlin/merlin-inference:21.11
- docker exec -it xxx /bin/bash
- tritonserver --model-repository=/model/ --backend-config=hugectr,ps=/model/ps.json --model-control-mode=explicit

Additional context
I1123 10:31:24.977723 421 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I1123 10:31:24.977975 421 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I1123 10:31:25.019065 421 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
I1123 10:31:53.109162 421 model_repository_manager.cc:1022] loading: criteo:1
I1123 10:31:53.227686 421 hugectr.cc:1140] TRITONBACKEND_Initialize: hugectr
I1123 10:31:53.227720 421 hugectr.cc:1150] Triton TRITONBACKEND API version: 1.6
I1123 10:31:53.227730 421 hugectr.cc:1156] 'hugectr' TRITONBACKEND API version: 1.6
I1123 10:31:53.227735 421 hugectr.cc:1181] The HugeCTR backend Repository location: /opt/tritonserver/backends/hugectr
I1123 10:31:53.227740 421 hugectr.cc:1191] The HugeCTR backend configuration:
{"cmdline":{"ps":"/model/ps.json"}}
I1123 10:31:53.227778 421 hugectr.cc:311] *****Parsing Parameter Server Configuration from /model/ps.json
I1123 10:31:53.227801 421 hugectr.cc:325] Enable support for Int64 embedding key: 0
I1123 10:31:53.227811 421 hugectr.cc:329] The depolyment Data base type is: local
I1123 10:31:53.227820 421 hugectr.cc:335] The depolyment cache_size_percentage_redis is:
I1123 10:31:53.227830 421 hugectr.cc:339] Redis ip is: 127.0.0.1:7000
I1123 10:31:53.227837 421 hugectr.cc:343] Local RocksDB path is:
I1123 10:31:53.227848 421 hugectr.cc:433] The HugeCTR Backend Parameter Server is creating...
I1123 10:31:53.227857 421 hugectr.cc:446] The HugeCTR Backend Backend Parameter Server(Int32) is creating...
Signal (11) received.
0# 0x00005622F47A58A9 in tritonserver
1# 0x00007F02F22BD210 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# HugeCTR::parameter_server::parameter_server(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&, std::vector<HugeCTR::InferenceParams, std::allocatorHugeCTR::InferenceParams >&) in /usr/local/hugectr/lib/libhugectr_inference.so
3# HugeCTR::HugectrUtility::Create_Parameter_Server(HugeCTR::INFER_TYPE, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&, std::vector<HugeCTR::InferenceParams, std::allocatorHugeCTR::InferenceParams >&) in /usr/local/hugectr/lib/libhugectr_inference.so
4# 0x00007F02D1643F52 in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
5# TRITONBACKEND_Initialize in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
6# 0x00007F02F2E34F7B in /opt/tritonserver/bin/../lib/libtritonserver.so
7# 0x00007F02F2E369FB in /opt/tritonserver/bin/../lib/libtritonserver.so
8# 0x00007F02F2E3F000 in /opt/tritonserver/bin/../lib/libtritonserver.so
9# 0x00007F02F2CE59BA in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F02F2CF37B1 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F02F26ABDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F02F2B29609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Segmentation fault (core dumped)

Unable to launch Triton Server with hps backend using latest HugeCTR and hugectr_backend repos

Description
I'm unable to install and run the Triton server using the HPS backend.

Triton Information
Triton v23.06

To Reproduce
Steps to reproduce the behavior.

I'm following steps (1) and (2) here (https://github.com/triton-inference-server/hugectr_backend) under the Build the HPS Backend from Scratch section. I follow all the steps exactly.

I'm doing all the steps in a container built from this image (https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/dockerfile.ctr).

After trying to launch Triton using tritonserver --model-repository=/opt/hugectr_testing/data/test_dask/output/model_inference --backend-config=hps,ps=/opt/hugectr_testing/data/test_dask/output/model_inference/ps.json, I get the following:

I1120 07:09:59.322961 19420 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f4fd6000000' with size 268435456
I1120 07:09:59.323460 19420 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1120 07:09:59.333655 19420 model_lifecycle.cc:462] loading: criteo:1
I1120 07:09:59.333798 19420 model_lifecycle.cc:462] loading: criteo_nvt:1
I1120 07:09:59.370486 19420 hps.cc:62] TRITONBACKEND_Initialize: hps
I1120 07:09:59.370512 19420 hps.cc:69] Triton TRITONBACKEND API version: 1.13
I1120 07:09:59.370519 19420 hps.cc:73] 'hps' TRITONBACKEND API version: 1.15
I1120 07:09:59.370536 19420 hps.cc:150] TRITONBACKEND_Backend Finalize: HPSBackend
E1120 07:09:59.370572 19420 model_lifecycle.cc:626] failed to load 'criteo' version 1: Unsupported: Triton backend API version does not support this backend
I1120 07:09:59.370602 19420 model_lifecycle.cc:753] failed to load 'criteo'
I1120 07:09:59.521670 19436 pb_stub.cc:255]  Failed to initialize Python stub for auto-complete: ModuleNotFoundError: No module named 'hugectr'

At:
  /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load

E1120 07:09:59.529317 19420 model_lifecycle.cc:626] failed to load 'criteo_nvt' version 1: Internal: ModuleNotFoundError: No module named 'hugectr'

At:
  /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load

I1120 07:09:59.529369 19420 model_lifecycle.cc:753] failed to load 'criteo_nvt'
E1120 07:09:59.529473 19420 model_repository_manager.cc:562] Invalid argument: ensemble 'criteo_ens' depends on 'criteo' which has no loaded version. Model 'criteo' loading failed with error: version 1 is at UNAVAILABLE state: Unsupported: Triton backend API version does not support this backend;
I1120 07:09:59.529552 19420 server.cc:603] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1120 07:09:59.529642 19420 server.cc:630] 
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                                        |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1120 07:09:59.529735 19420 server.cc:673] 
+------------+---------+-------------------------------------------------------------------------------------------------+
| Model      | Version | Status                                                                                          |
+------------+---------+-------------------------------------------------------------------------------------------------+
| criteo     | 1       | UNAVAILABLE: Unsupported: Triton backend API version does not support this backend              |
| criteo_nvt | 1       | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'hugectr'                           |
|            |         |                                                                                                 |
|            |         | At:                                                                                             |
|            |         |   /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module> |
|            |         |   <frozen importlib._bootstrap>(241): _call_with_frames_removed                                 |
|            |         |   <frozen importlib._bootstrap_external>(883): exec_module                                      |
|            |         |   <frozen importlib._bootstrap>(703): _load_unlocked                                            |
|            |         |   <frozen importlib._bootstrap>(1006): _find_and_load_unlocked                                  |
|            |         |   <frozen importlib._bootstrap>(1027): _find_and_load                                           |
+------------+---------+-------------------------------------------------------------------------------------------------+

I1120 07:09:59.575939 19420 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB
I1120 07:09:59.576319 19420 metrics.cc:701] Collecting CPU metrics
I1120 07:09:59.576525 19420 tritonserver.cc:2385] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                         |
| server_version                   | 2.35.0                                                                                                                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace loggin |
|                                  | g                                                                                                                                                                                                              |
| model_repository_path[0]         | /opt/hugectr_testing/data/test_dask/output/model_inference                                                                                                                                                     |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                      |
| strict_model_config              | 0                                                                                                                                                                                                              |
| rate_limit                       | OFF                                                                                                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                       |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                            |
| strict_readiness                 | 1                                                                                                                                                                                                              |
| exit_timeout                     | 30                                                                                                                                                                                                             |
| cache_enabled                    | 0                                                                                                                                                                                                              |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1120 07:09:59.576581 19420 server.cc:304] Waiting for in-flight requests to complete.
I1120 07:09:59.576592 19420 server.cc:320] Timeout 30: Found 0 model versions that have in-flight inferences
I1120 07:09:59.576617 19420 server.cc:335] All models are stopped, unloading models
I1120 07:09:59.576634 19420 server.cc:342] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

I trained my model using this example notebook: https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb

However, this notebook came out before the HugeCTR backend was merged with the HPS backend. As a result, I needed to manually a line in my config.pbtxt to go from the hugectr to hps backend => backend: "hugectr" to backend: "hps".

Expected behavior
First, when building the inference version, I expect hugectr's Python module to be installed, but it isn't. This is weird because when I turn off the -DENABLE_INFERENCE=ON and install, import hugectr works.

Second, I expect the Triton server to start and accept requests.

[BUG] HugeCTR backend 3.4.1 has crashed

The Triton w/ hugectr_backend server suddenly has died with segmentation fault after 45 hours while processing repeated mild requests. CPU usage was only 18.6% and Memory usage was only 18.8%. There was no CPU usage spike during normal request processing, and there was one spike before the server died leaving a large 75GB coredump file.

It was repeatedly throwing 500,000 benchmark data and only had 10 request threads.

top - 19:29:02 up 184 days, 23:44,  0 users,  load average: 11.37, 10.93, 10.87
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 16.4 us,  3.1 sy,  0.0 ni, 80.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 386626.8 total, 242976.1 free,  46327.4 used,  97323.2 buff/cache
MiB Swap:   4096.2 total,   1095.7 free,   3000.6 used. 302357.5 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 14365 root      20   0  352.3g  70.9g  33.2g S 312.7  18.8   6139:48 tritonserver
     1 root      20   0    5672    216      0 S   0.0   0.0   0:00.08 bash
  5409 root      20   0    5676   2580   1932 S   0.0   0.0   0:00.06 bash
 14357 root      20   0    5460      0      0 S   0.0   0.0   0:00.00 run_server.sh
 33886 root      20   0    7564   3664   3108 R   0.0   0.0   0:00.00 top

version & server info
server : V100 8-GPU server with 384GB RAM
version
- container : nvcr.io/nvidia/merlin/merlin-inference:22.03
- hugectr_backend : v3.4.1

Signal (11) received.
 0# 0x000055B2063FC299 in tritonserver
 1# 0x00007F07EBCC20C0 in /lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F07EBE0A96A in /lib/x86_64-linux-gnu/libc.so.6
 3# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
 4# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
 5# 0x00007F07E044929D in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 6# TRITONBACKEND_ModelInstanceExecute in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 7# 0x00007F07EC866F9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007F07EC8676B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F07EC6FF1A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F07EC861527 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F07EC0B3DE4 in /lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F07EC530609 in /lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /lib/x86_64-linux-gnu/libc.so.6

./run_server.sh: line 11: 14365 Segmentation fault      (core dumped) tritonserver --model-repository=/naver/models/ --load-model=meb --model-control-mode=explicit --backend-directory=/usr/local/hugectr/backends --backend-config=hugectr,ps=/naver/models/meb/ps.json --log-info=false --log-verbose=0

server config

ps.json

{
    "supportlonglong":"true",
    "volatile_db": {
        "type":"parallel_hash_map",
        "initial_cache_rate":1.0,
        "overflow_margin":120000000,
        "max_get_batch_size": 100000,
        "max_set_batch_size": 100000
    },
    "models":[
        {
            "model":"meb",
            "supportlonglong":true,
            "num_of_worker_buffer_in_pool":"4",
                "num_of_refresher_buffer_in_pool":"1",
                "deployed_device_list":[0, 1, 2, 3, 4, 5, 6, 7],
                "max_batch_size":100,
                "default_value_for_each_table":[0.0],
                "hit_rate_threshold":"0.8",
                "gpucacheper":"1.0",
                "gpucache":"true",
                "cache_refresh_percentage_per_iteration":0.2,
                    "sparse_files":["/naver/models/video-meb-bmtest-v3/1/meb0_sparse_0.model"],
            "dense_file":"/naver/models/video-meb-bmtest-v3/1/meb_dense_0.model",
            "network_file":"/naver/models/video-meb-bmtest-v3/1/meb.json"
        }
    ]
}

config.pbtxt

name: "meb"
backend: "hugectr"
max_batch_size:100,
input [
   {
    name: "DES"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "CATCOLUMN"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "ROWINDEX"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group [
  {
    count: 4
    kind : KIND_GPU
  }
]

parameters [
  {
  key: "config"
  value: { string_value: "/naver/models/video-meb-bmtest-v3/1/meb.json" }
  },
  {
  key: "gpucache"
  value: { string_value: "true" }
  },
  {
  key: "hit_rate_threshold"
  value: { string_value: "0.8" }
  },
  {
  key: "gpucacheper"
  value: { string_value: "1.0" }
  },
  {
  key: "label_dim"
  value: { string_value: "1" }
  },
  {
  key: "slots"
  value: { string_value: "15" }
  },
  {
  key: "cat_feature_num"
  value: { string_value: "8000" }
  },
  {
  key: "des_feature_num"
  value: { string_value: "0" }
  },
  {
  key: "max_nnz"
  value: { string_value: "8000" }
  },
  {
  key: "embedding_vector_size"
  value: { string_value: "15" }
  },
  {
  key: "embeddingkey_long_type"
  value: { string_value: "true" }
  }
]

Inconsistency between hugeCTR v3.3.1 and hugectr_backend v3.3.1

Hi Folks,

I just tried building the latest tags from scratch and hugectr_backend fails with three issues:

Use of <filesystem> requires c++std17
It looks like hugectr_backend anticipates the need for {db_type, redis_ip, rocksdb_path, cache_size_percentage_redis} to kept in HugeCTR::InferenceParams but they are nowhere used at present; either add the members or hold off on the assignments until they are needed.
Warning as fatal error; variable assigned and not used "float cache_size_percentage_redis=0.1

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 215b520..479dd2d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -99,7 +99,7 @@ target_include_directories(
     ${CMAKE_CURRENT_SOURCE_DIR}/src
 )

-target_compile_features(triton-hugectr-backend PRIVATE cxx_std_11)
+target_compile_features(triton-hugectr-backend PRIVATE cxx_std_17)
 target_compile_options(
   triton-hugectr-backend PRIVATE
   $<$<OR:$<CXX_COMPILER_ID:Clang>,$<CXX_COMPILER_ID:AppleClang>,$<CXX_COMPILER_ID:GNU>>:
diff --git a/src/hugectr.cc b/src/hugectr.cc
index 2979c85..629259a 100644
--- a/src/hugectr.cc
+++ b/src/hugectr.cc
@@ -298,7 +298,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
   LOG_MESSAGE(TRITONSERVER_LOG_INFO,(std::string("The depolyment Data base type is: ") + db_type).c_str());

   float cache_size_percentage_redis=0.1;
-  std::string cpu_cache_per;
+  std::string cpu_cache_per = std::to_string(cache_size_percentage_redis);
   parameter_server_config.MemberAsString("cache_size_percentage_redis", &cpu_cache_per);
   cache_size_percentage_redis=std::atof(cpu_cache_per.c_str());
   LOG_MESSAGE(TRITONSERVER_LOG_INFO,(std::string("The depolyment cache_size_percentage_redis is: ") + cpu_cache_per).c_str());
@@ -340,6 +340,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
     }

     HugeCTR::InferenceParams infer_param(modelname, 64, 0.55, dense, sparses, 0, true, 0.55, support_int64_key_);
+    /*
     if(db_type== "local"){
       infer_param.db_type=HugeCTR::DATABASE_TYPE::LOCAL;
     }
@@ -353,6 +354,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
     infer_param.redis_ip =redis_ip;
     infer_param.rocksdb_path = rocksdb_path;
     infer_param.cache_size_percentage_redis = cache_size_percentage_redis;
+     */
     inference_params_map.insert(std::pair<std::string, HugeCTR::InferenceParams>(modelname, infer_param));
   }
   return nullptr;

hugectr memory pool is empty

hello. i tested hugectr backend again. and I use perf_analyzer

I use merlin-inference:22:03 and v100 hugectr backend use 28G gpu. so there is 4G gpu memory remain

when I tested with perf_analyzer concurrency 17 then throughput rapidly drop and hugectr log with memory pool is empty. how can I fix this?

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1906.57 infer/sec, latency 520 usec
Concurrency: 5, throughput: 7677.62 infer/sec, latency 647 usec
Concurrency: 9, throughput: 11505.2 infer/sec, latency 778 usec
Concurrency: 13, throughput: 12422.1 infer/sec, latency 1042 usec
Concurrency: 17, throughput: 3247.32 infer/sec, latency 5232 usec
Concurrency: 21, throughput: 764.767 infer/sec, latency 27540 usec
Concurrency: 25, throughput: 240.7 infer/sec, latency 103849 usec
Concurrency: 29, throughput: 217.05 infer/sec, latency 133800 usec

[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][INFO][RK0][EC insert #10]: *****Insert embedding cache of model meb on device 4*****
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty

and I want change huge ctr log option. I already use triton_server option --log-verbose=0 and --log-info=false. but hugectr log display info log. I can I change hugectr log option?

[BUG] support google cloud storage path in hugeCTR backend

When serving models in cloud, users typically leverage a cloud storage path like gs://GCS_bucket/models/deepfm/1/deepfm0_sparse_0.model

For example, the config.pbtxt will contains

parameters {
  key: "config"
  value {
    string_value: "gs://[bucket_name]/models/deepfm/1/deepfm.json"
  }
}

Current hugeCTR backend will report I1030 21:59:03.044715 167 hugectr.cc:282] Fail to open Parameter Server Configuration, please check whether the file path is correct

inference error use triton http client.

I1125 01:05:26.620924 657 hugectr.cc:1107] The model origin json configuration file path is: /data/zxt/t3/dcn/t3.json
[HUGECTR][01:05:26][INFO][RANK0]: Global seed is 2802872729
[HUGECTR][01:05:27][WARNING][RANK0]: Peer-to-peer access cannot be fully enabled.
[HUGECTR][01:05:27][INFO][RANK0]: Start all2all warmup
[HUGECTR][01:05:27][INFO][RANK0]: End all2all warmup
[HUGECTR][01:05:27][INFO][RANK0]: Use mixed precision: 0
[HUGECTR][01:05:27][INFO][RANK0]: start create embedding for inference
[HUGECTR][01:05:27][INFO][RANK0]: sparse_input name data1
[HUGECTR][01:05:27][INFO][RANK0]: create embedding for inference success
[HUGECTR][01:05:27][INFO][RANK0]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
I1125 01:05:38.177670 657 hugectr.cc:1110] ******Loading HugeCTR model successfully
I1125 01:05:38.177843 657 model_repository_manager.cc:1183] successfully loaded 'dcn_t3' version 1
I1125 01:05:42.905847 657 model_repository_manager.cc:1022] loading: dcn_t3_ens:1
I1125 01:05:43.006289 657 model_repository_manager.cc:1183] successfully loaded 'dcn_t3_ens' version 1
1125 01:07:03.109141 684 pb_stub.cc:402] Failed to process the request(s) for model 'dcn_t3_nvt_0', message: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

At:
/root/.local/lib/python3.8/site-packages/pandas-1.3.4-py3.8-linux-x86_64.egg/pandas/core/generic.py(1537): nonzero
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(266): _transform_tensors
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(265): _transform_tensors
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(143): execute

1125 01:10:03.070768 684 pb_stub.cc:402] Failed to process the request(s) for model 'dcn_t3_nvt_0', message: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

At:
/root/.local/lib/python3.8/site-packages/pandas-1.3.4-py3.8-linux-x86_64.egg/pandas/core/generic.py(1537): nonzero

How to turn off only the request log? (The request log is too verbose, resulting in poor throughput)

The more detailed the server startup log, the better it will prevent mistakes or help debugging.
However, if the request log becomes too verbose, it leads to poor server throughput performance.
In fact, turning on the info log alone cuts the server throughput in half for the same 40 thread request.

How can I make it quiet by turning off only the request log while leaving the info log on when the server starts?

Below is the client-side benchmark result for the same request.

--log-info=false

This leaves no lines in the server log.

[Q0000041000][T12] 2282.60QPS avg.etl/infer/total:  3.87ms  10.00ms  15.20ms top.etl/infer/total:  9.00ms  15.00ms  22.00ms
[Q0000042000][T10] 2281.58QPS avg.etl/infer/total:  3.69ms  11.43ms  16.29ms top.etl/infer/total:  8.00ms  17.00ms  25.00ms
[Q0000043000][T11] 2282.70QPS avg.etl/infer/total:  4.13ms  11.52ms  16.87ms top.etl/infer/total: 10.00ms  21.00ms  28.00ms
[Q0000044000][T14] 2281.04QPS avg.etl/infer/total:  4.10ms  10.98ms  16.33ms top.etl/infer/total:  9.00ms  18.00ms  26.00ms
[Q0000045000][T33] 2280.70QPS avg.etl/infer/total:  3.43ms  11.81ms  16.51ms top.etl/infer/total:  8.00ms  19.00ms  24.00ms
[Q0000046000][T03] 2279.96QPS avg.etl/infer/total:  4.07ms   9.99ms  15.36ms top.etl/infer/total: 10.00ms  16.00ms  23.00ms

--log-info=true

This option makes the server request log too verbose, cutting throughput in half.

[Q0000041000][T11] 1108.63QPS avg.etl/infer/total:  3.89ms  37.53ms  42.71ms top.etl/infer/total:  9.00ms  69.00ms  77.00ms
[Q0000042000][T26] 1108.04QPS avg.etl/infer/total:  4.26ms  22.60ms  28.14ms top.etl/infer/total: 10.00ms  33.00ms  40.00ms
[Q0000043000][T37] 1107.30QPS avg.etl/infer/total:  3.88ms  23.35ms  28.63ms top.etl/infer/total: 10.00ms  39.00ms  45.00ms
[Q0000044000][T19] 1109.48QPS avg.etl/infer/total:  4.11ms  23.69ms  29.10ms top.etl/infer/total: 10.00ms  35.00ms  42.00ms
[Q0000045000][T17] 1107.12QPS avg.etl/infer/total:  3.81ms  41.57ms  46.75ms top.etl/infer/total:  9.00ms  75.00ms  81.00ms
[Q0000046000][T27] 1108.08QPS avg.etl/infer/total:  4.35ms  24.15ms  29.93ms top.etl/infer/total: 14.00ms  32.00ms  43.00ms

how to update model weight (triton_server) after continue training!!

hello. we test continue training.

we follow this example. https://github.com/triton-inference-server/hugectr_backend/blob/v3.5/samples/hierarchical_deployment/hps_e2e_demo/Continuous_Training.ipynb

so we tested this sample with our simple dataset. first we use Kafka and training. and we config Kafka setting in triton ps.json. it work well.

hugectr training log
triton log

but we can't find how to update model weight triton server after update embedding.

we found other way. that is use load api. but load api load model weight and embedding too. in case. we don't need use Kafka. just update model, embedding. but we can't find example how to use load api.

so we found two triton embedding update scenario

continue training with Kafka, triton. but just update embedding. (we want know how to update model weight after that or freeze_dense?)
continue training no Kafka, triton. use load api and load model weight and update embedding.