Giter Site home page Giter Site logo

hugectr_backend's Introduction

License

Hierarchical Parameter Server Backend

The Hierarchical Parameter Server(HPS) Backend is a framework for embedding vectors looking up on large-scale embedding tables that was designed to effectively use GPU memory to accelerate the looking up by decoupling the embedding tables and embedding cache from the end-to-end inference pipeline of the deep recommendation model. The HPS Backend supports executing multiple embedding vector looking-up services concurrently across multiple GPUs by embedding cache that is shared between multiple look_up sessions. For more information, see Hierarchical Parameter Server Architecture.

Quick Start

You can build the HPS Backend from scratch and install to the specify path based on your own specific requirements using the NGC Merlin inference Docker images.

We support the following compute capabilities for inference deployment:

Compute Capability GPU SM
7.0 NVIDIA V100 (Volta) 70
7.5 NVIDIA T4 (Turing) 75
8.0 NVIDIA A100 (Ampere) 80
8.6 NVIDIA A10 (Ampere) 72

The following prerequisites must be met before installing or building the HugeCTR Backend from scratch:

  • Docker version 19 and higher
  • cuBLAS version 10.1
  • CMake version 3.17.0
  • cuDNN version 7.5
  • RMM version 0.16
  • GCC version 7.4.0

Install the HPS Backend Using NGC Containers

All NVIDIA Merlin components are available as open-source projects. However, a more convenient way to make use of these components is by using Merlin NGC containers. These NGC containers allow you to package your software application, libraries, dependencies, and runtime compilers in a self-contained environment. When installing the HPS Backend using NGC containers, the application environment remains both portable, consistent, reproducible, and agnostic to the underlying host system software configuration. The HPS Backend container has the necessary libraries and header files pre-installed, and you can directly deploy the HPS models to production.

Docker images for the HPS Backend are available in the NVIDIA container repository on https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-hugectr. You can pull and launch the container by running the following command:

docker run --gpus=1 --rm -it nvcr.io/nvidia/merlin/merlin-hugectr:23.09 # Start interaction mode  

NOTE: The HPS backend is derived from the HugeCTR backend. As of HugeCTR version 3.0, the HugeCTR container is no longer being released separately. If you're an advanced user, you should use the unified Merlin container to build the HugeCTR Training or Inference Docker image from scratch based on your own specific requirements. You can obtain the unified Merlin container by logging into NGC or by going here.

Build the HPS Backend from Scratch

Before building the HPS inference backend from scratch, you must first verify that the HugeCTR inference shared library (libhuge_ctr_inference.so) has been compiled. Then you can generate a HPS shared library (libtriton_hps.so), and copy to the HugeCTR/HPS default path to complete the backend building. The default path where all the HugeCTR and HPS Backend libraries and header files are installed in is /usr/local/hugectr.

  1. Building HugeCTR inference shared libarary from scratch, you should download the HugeCTR repository and the third-party modules that it relies on by running the following commands:

    $ git clone https://github.com/NVIDIA/HugeCTR.git
    # cd HugeCTR
    $ git submodule update --init --recursive
    

    Build HugeCTR inference backend

    $ mkdir -p build && cd build
    $ cmake -DCMAKE_BUILD_TYPE=Release -DSM="70;80" -DENABLE_INFERENCE=ON ..
    $ make -j && make install
    

    For more information, see Build HPS from Source. After compiling, you can find the libhuge_ctr_hps.so file in the path /usr/local/hugectr/lib.

  2. Buidling HPS inference backend. Download the HPS Backend repository by running the following commands:

    $ git https://github.com/triton-inference-server/hugectr_backend.git
    $ cd hugectr_backend/hps_backend
    

    Use CMAKE to build and install the HPS Backend as follows:

    $ mkdir -p build && cd build
    $ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_COMMON_REPO_TAG=<rxx.yy>  -DTRITON_CORE_REPO_TAG=<rxx.yy> -DTRITON_BACKEND_REPO_TAG=<rxx.yy> ..
    $ make install
    $ ls             # check your compiled shared library(libtriton_hps.so)
    

    NOTE: Where <rxx.yy> is the "release version" of Triton that you want to deploy, like r23.06. You can use tritonserver command to confirm your current "server_version", and find the corresponding "release version" according to the "server_version" in triton release note. For example, r23.06 corresponding to 2.20.0 Triton "server_version".

    Option Value
    server_id triton
    server_version 2.35.0
    release version r23.06
  3. Copy the compiled shared library(libtriton_hps.so) to your specified HPS default path. Please remember to specify the absolute path of the local directory that installs the HPS Backend for the --backend-directory argument when launching the Triton server. For example, copy to /usr/local/hugectr/backends/hps folder, and the sample command to start tritonserver would be

    $ tritonserver --model-repository=/path/to/model_repo/ --load-model=model_name \
     --model-control-mode=explicit \
     --backend-directory=/usr/local/hugectr/backends \
     --backend-config=hps,ps=/path/to/model_repo/hps.json
    

    The following Triton repositories, which are required, will be pulled and used in the build. By default, the "main" branch/tag will be used for each repository. However, the following cmake arguments can be used to override the "main" branch/tag:

    • triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag]
    • triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
    • triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag]

    For more reference, see Triton example backends and Triton backend shared library.

Independent Inference Hierarchical Parameter Server Configuration

The HPS backend configuration file is basically the same as HugeCTR inference Parameter Server related configuration format, and some new configuration items are added for the HPS backend. Especially for the configuration of multiple embedded tables per model, avoid too many command parameters and reasonable memory pre-allocation when launching the Triton server.

In order to deploy the embedding table on HPS Backend, some customized configuration items need to be added as follows. The configuration file of HPS Backend should be formatted using the JSON format.

NOTE: The Models clause needs to be included as a list, the specific configuration of each model as an item. sparse_file can be filled with multiple embedding table paths to support multiple embedding tables per model.

{
    "supportlonglong": true,
    "volatile_db": {
        "type": "hash_map",
        "user_name": "default",
        "num_partitions": 8,
        "max_batch_size": 100000,
        "overflow_policy": "evict_random",
        "overflow_margin": 10000000,
        "overflow_resolution_target": 0.8,
        "initial_cache_rate": 1.0
    },
    "persistent_db": {
        "type": "disabled"
    },
    "models": [{
        "model": "hps_wdl",
        "sparse_files": ["/hps_infer/embedding/hps_wdl/1/wdl0_sparse_2000.model", "/hps_infer/embedding/hps_wdl/1/wdl1_sparse_2000.model"],
        "num_of_worker_buffer_in_pool": 3,
        "embedding_table_names":["embedding_table1","embedding_table2"],
        "embedding_vecsize_per_table":[1,16],
        "maxnum_catfeature_query_per_table_per_sample":[2,26],
        "default_value_for_each_table":[0.0,0.0],
        "deployed_device_list":[0],
        "max_batch_size":1024,
        "hit_rate_threshold":0.9,
        "gpucacheper":0.5,
        "gpucache":true
        }
    ]
}

Model Repository Extension

Since the HPS Backend is a customizable Triton component, it is capable of supporting the Model Repository Extension. Triton's Model Repository Extension allows you to query and control model repositories that are being served by Triton. The “model_repository” is reported in the Extensions field of its server metadata. For more information, see Model Repository Extension.

The HPS Backend is fully compatible with the Model Control EXPLICIT Mode of Triton. Adding the configuration of a new model to the HPS configuration file. The HPS Backend has supported online deployment of new models by the load API of Triton. The old models can also be recycled online by the unload API.

The following should be noted when using Model Repository Extension functions:

  • Depoly new models online: The load API will load not only the network dense weight as part of the HugeCTR model, but inserting the embedding table of new models to Hierarchical Inference Parameter Server and creating the embedding cache based on model definition in Independent Parameter Server Configuration, which means the Parameter server will independently provide an initialization mechanism for the new embedding table and embedding cache of new models.

Note: If using the HPS Inference Online Update, in order to avoid the embedding table from being updated repeatedly by adding the freeze_sparse(false is default ) update option in the Triton configuration file (config.pbtxt).

parameters:[
   ...
 {
 key: "freeze_sparse"
 value: { string_value: "true" }
 }
   ...
]

Metrix

Triton provides Prometheus metrics indicating GPU and request statistics. Use Prometheus to gather metrics into usable, actionable entries, giving you the data you need to manage alerts and performance information in your environment. Prometheus is usually used along side Grafana. Grafana is a visualization tool that pulls Prometheus metrics and makes it easier to monitor. You can build your own metrix system based on our example, see HPS Backend Metrics.

HPS Inference Hierarchical Parameter Server

HPS Inference Hierarchical Parameter Server implemented a hierarchical storage mechanism between local SSDs and CPU memory, which breaks the convention that the embedding table must be stored in local CPU memory. volatile_db Database layer allows utilizing Redis cluster deployments, to store and retrieve embeddings in/from the RAM memory available in your cluster. The Persistent Database layer links HPS with a persistent database. Each node that has such a persistent storage layer configured retains a separate copy of all embeddings in its locally available non-volatile memory. see Distributed Deployment and Hierarchical Parameter Server for more details.

In the following table, we provide an overview of the typical properties different parameter database layers (and the embedding cache). We emphasize that this table is just intended to provide a rough orientation. Properties of actual deployments may deviate.

GPU Embedding Cache CPU Memory Database Distributed Database (InfiniBand) Distributed Database (Ethernet) Persistent Database
Mean Latency ns ~ us us ~ ms us ~ ms several ms ms ~ s
Capacity (relative) ++ +++ +++++ +++++ +++++++
Capacity (range in practice) 10 GBs ~ few TBs 100 GBs ~ several TBs several TBs several TBs up to 100s of TBs
Cost / Capacity ++++ +++ ++++ ++++ +
Volatile yes yes configuration dependent configuration dependent no
Configuration / maintenance complexity low low high high low
  • Embedding Cache Asynchronous Refresh Mechanism

We have supported the asynchronous refreshing of incremental embedding keys into the embedding cache. Refresh operation will be triggered when the sparse model files need to be updated into GPU embedding Cache. After completing the model version iteration or incremental parameters update of the model based on from online training, the latest embedding table needs to be updated to the embedding cache on the inference server. In order to ensure that the running model can be updated online, we will update the Distributed Database and Persistent Database through the distributed event streaming platform(Kafka). At the same time, the GPU embedding cache will refresh the values of the existing embedding keys and replace them with the latest incremental embedding vectors.

  • Embedding Cache Asynchronous Insertion Mechanism

We have supported the asynchronous insertion of missing embedding keys into the embedding cache. This feature can be activated automatically through user-defined hit rate threshold in configuration file.When the real hit rate of the embedding cache is higher than the user-defined threshold, the embedding cache will insert the missing key asynchronously, and vice versa, it will still be inserted in a synchronous way to ensure high accuracy of inference requests. Through the asynchronous insertion method, compared with the previous synchronous method, the real hit rate of the embedding cache can be further improved after the embedding cache reaches the user-defined threshold.

  • Performance Optimization of Inference Parameter Server

We have added support for multiple database interfaces to our inference parameter server. In particular, we added an “in memory” database, that utilizes the local CPU memory for storing and recalling embeddings and uses multi-threading to accelerate look-up and storage.
Further, we revised support for “distributed” storage of embeddings in a Redis cluster. This way, you can use the combined CPU-accessible memory of your cluster for storing embeddings. The new implementation is up over two orders of magnitude faster than the previous.
Further, we performance-optimized support for the “persistent” storage and retrieval of embeddings via RocksDB through the structured use of column families. Creating a hierarchical storage (i.e. using Redis as distributed cache, and RocksDB as fallback), is supported as well. These advantages are free to end-users, as there is no need to adjust the PS configuration.

Hierarchical Parameter Server Online Update

If an incremental update has been applied to some embedding table entries, either during online training (=frequent/incremental updates), or after completing an offline training, the latest versions of the respectively updated embeddings have to be propagated to all inference nodes. Our HPS achieves this functionality using a dedicated online updating mechanism. The blue data-flow graph in below figure illustrates this process. First, the training nodes dump their updates to an Apache Kafka-based message buffer. This is done via our Message Producer API, which handles serialization, batching, and the organization of updates into distinct message queues for each embedding table. Inference nodes that have loaded the affected model can use the corresponding Message Source API to discover and subscribe to these message queues. Received updates are then subsequently applied to the respective local VDB shards and the PDB. The GPU embedding cache polls its associated VDB/PDB for updates and replaces embeddings if necessary. This refresh cycle is configurable to best fit the training schedule. When using online training, the GPU embedding cache periodically (e.g., every $n$ minutes, hours, etc.) scans for updates and refresh its contents. During offline training, poll-cycles are instigated by the Triton model management API.

Fig. 1. HPS Inference Online Update

hugectr_backend's People

Contributors

bashimao avatar emmaqiaoch avatar guanluo avatar jershi425 avatar kingsleyliu-nv avatar shijieliu avatar yingcanw avatar zehuanw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hugectr_backend's Issues

[BUG] support google cloud storage path in hugeCTR backend

When serving models in cloud, users typically leverage a cloud storage path like gs://GCS_bucket/models/deepfm/1/deepfm0_sparse_0.model

For example, the config.pbtxt will contains

parameters {
  key: "config"
  value {
    string_value: "gs://[bucket_name]/models/deepfm/1/deepfm.json"
  }
}

Current hugeCTR backend will report I1030 21:59:03.044715 167 hugectr.cc:282] Fail to open Parameter Server Configuration, please check whether the file path is correct

[BUG] HugeCTR backend 3.4.1 has crashed

The Triton w/ hugectr_backend server suddenly has died with segmentation fault after 45 hours while processing repeated mild requests. CPU usage was only 18.6% and Memory usage was only 18.8%. There was no CPU usage spike during normal request processing, and there was one spike before the server died leaving a large 75GB coredump file.

It was repeatedly throwing 500,000 benchmark data and only had 10 request threads.

스크린샷 2022-04-29 오전 10 43 38

스크린샷 2022-04-29 오전 10 43 49

top - 19:29:02 up 184 days, 23:44,  0 users,  load average: 11.37, 10.93, 10.87
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 16.4 us,  3.1 sy,  0.0 ni, 80.3 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem : 386626.8 total, 242976.1 free,  46327.4 used,  97323.2 buff/cache
MiB Swap:   4096.2 total,   1095.7 free,   3000.6 used. 302357.5 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 14365 root      20   0  352.3g  70.9g  33.2g S 312.7  18.8   6139:48 tritonserver
     1 root      20   0    5672    216      0 S   0.0   0.0   0:00.08 bash
  5409 root      20   0    5676   2580   1932 S   0.0   0.0   0:00.06 bash
 14357 root      20   0    5460      0      0 S   0.0   0.0   0:00.00 run_server.sh
 33886 root      20   0    7564   3664   3108 R   0.0   0.0   0:00.00 top
  • version & server info
  • server : V100 8-GPU server with 384GB RAM
  • version
    • container : nvcr.io/nvidia/merlin/merlin-inference:22.03
    • hugectr_backend : v3.4.1
Signal (11) received.
 0# 0x000055B2063FC299 in tritonserver
 1# 0x00007F07EBCC20C0 in /lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F07EBE0A96A in /lib/x86_64-linux-gnu/libc.so.6
 3# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
 4# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
 5# 0x00007F07E044929D in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 6# TRITONBACKEND_ModelInstanceExecute in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 7# 0x00007F07EC866F9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007F07EC8676B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F07EC6FF1A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F07EC861527 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F07EC0B3DE4 in /lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F07EC530609 in /lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /lib/x86_64-linux-gnu/libc.so.6

./run_server.sh: line 11: 14365 Segmentation fault      (core dumped) tritonserver --model-repository=/naver/models/ --load-model=meb --model-control-mode=explicit --backend-directory=/usr/local/hugectr/backends --backend-config=hugectr,ps=/naver/models/meb/ps.json --log-info=false --log-verbose=0

server config

  • ps.json
{
    "supportlonglong":"true",
    "volatile_db": {
        "type":"parallel_hash_map",
        "initial_cache_rate":1.0,
        "overflow_margin":120000000,
        "max_get_batch_size": 100000,
        "max_set_batch_size": 100000
    },
    "models":[
        {
            "model":"meb",
            "supportlonglong":true,
            "num_of_worker_buffer_in_pool":"4",
                "num_of_refresher_buffer_in_pool":"1",
                "deployed_device_list":[0, 1, 2, 3, 4, 5, 6, 7],
                "max_batch_size":100,
                "default_value_for_each_table":[0.0],
                "hit_rate_threshold":"0.8",
                "gpucacheper":"1.0",
                "gpucache":"true",
                "cache_refresh_percentage_per_iteration":0.2,
                    "sparse_files":["/naver/models/video-meb-bmtest-v3/1/meb0_sparse_0.model"],
            "dense_file":"/naver/models/video-meb-bmtest-v3/1/meb_dense_0.model",
            "network_file":"/naver/models/video-meb-bmtest-v3/1/meb.json"
        }
    ]
}
  • config.pbtxt
name: "meb"
backend: "hugectr"
max_batch_size:100,
input [
   {
    name: "DES"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "CATCOLUMN"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "ROWINDEX"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group [
  {
    count: 4
    kind : KIND_GPU
  }
]

parameters [
  {
  key: "config"
  value: { string_value: "/naver/models/video-meb-bmtest-v3/1/meb.json" }
  },
  {
  key: "gpucache"
  value: { string_value: "true" }
  },
  {
  key: "hit_rate_threshold"
  value: { string_value: "0.8" }
  },
  {
  key: "gpucacheper"
  value: { string_value: "1.0" }
  },
  {
  key: "label_dim"
  value: { string_value: "1" }
  },
  {
  key: "slots"
  value: { string_value: "15" }
  },
  {
  key: "cat_feature_num"
  value: { string_value: "8000" }
  },
  {
  key: "des_feature_num"
  value: { string_value: "0" }
  },
  {
  key: "max_nnz"
  value: { string_value: "8000" }
  },
  {
  key: "embedding_vector_size"
  value: { string_value: "15" }
  },
  {
  key: "embeddingkey_long_type"
  value: { string_value: "true" }
  }
]

inference error use triton http client.

I1125 01:05:26.620924 657 hugectr.cc:1107] The model origin json configuration file path is: /data/zxt/t3/dcn/t3.json
[HUGECTR][01:05:26][INFO][RANK0]: Global seed is 2802872729
[HUGECTR][01:05:27][WARNING][RANK0]: Peer-to-peer access cannot be fully enabled.
[HUGECTR][01:05:27][INFO][RANK0]: Start all2all warmup
[HUGECTR][01:05:27][INFO][RANK0]: End all2all warmup
[HUGECTR][01:05:27][INFO][RANK0]: Use mixed precision: 0
[HUGECTR][01:05:27][INFO][RANK0]: start create embedding for inference
[HUGECTR][01:05:27][INFO][RANK0]: sparse_input name data1
[HUGECTR][01:05:27][INFO][RANK0]: create embedding for inference success
[HUGECTR][01:05:27][INFO][RANK0]: Inference stage skip BinaryCrossEntropyLoss layer, replaced by Sigmoid layer
I1125 01:05:38.177670 657 hugectr.cc:1110] ******Loading HugeCTR model successfully
I1125 01:05:38.177843 657 model_repository_manager.cc:1183] successfully loaded 'dcn_t3' version 1
I1125 01:05:42.905847 657 model_repository_manager.cc:1022] loading: dcn_t3_ens:1
I1125 01:05:43.006289 657 model_repository_manager.cc:1183] successfully loaded 'dcn_t3_ens' version 1
1125 01:07:03.109141 684 pb_stub.cc:402] Failed to process the request(s) for model 'dcn_t3_nvt_0', message: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

At:
/root/.local/lib/python3.8/site-packages/pandas-1.3.4-py3.8-linux-x86_64.egg/pandas/core/generic.py(1537): nonzero
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(266): _transform_tensors
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(265): _transform_tensors
/data/zxt/t3/model/dcn_t3_nvt/1/model.py(143): execute

1125 01:10:03.070768 684 pb_stub.cc:402] Failed to process the request(s) for model 'dcn_t3_nvt_0', message: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

At:
/root/.local/lib/python3.8/site-packages/pandas-1.3.4-py3.8-linux-x86_64.egg/pandas/core/generic.py(1537): nonzero

[BUG] big CATCOLUMN, ROWINDEX server crash

hello. I am using merlin-inference:22.03.

when I test hugectr backend small batch_size with CATCOLUMN, ROWINDEX server works well. but when I using big batch_size such as 50~100 then server abort with signal (6) or signal (11)

I0321 08:59:58.537734 1 infer_request.cc:675] prepared: [0x0x7fcf600059e0] request id: 1, model: meb, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fcf60006d28] input: ROWINDEX, type: INT32, original shape: [1,1351], batch + shape: [1,1351], shape: [1351]
[0x0x7fcf60007448] input: CATCOLUMN, type: INT64, original shape: [1,1916], batch + shape: [1,1916], shape: [1916]
[0x0x7fcf60007b78] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
override inputs:
inputs:
[0x0x7fcf60007b78] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
[0x0x7fcf60007448] input: CATCOLUMN, type: INT64, original shape: [1,1916], batch + shape: [1,1916], shape: [1916]
[0x0x7fcf60006d28] input: ROWINDEX, type: INT32, original shape: [1,1351], batch + shape: [1,1351], shape: [1351]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I0321 08:59:58.537824 1 hugectr.cc:1988] model meb, instance meb, executing 1 requests
I0321 08:59:58.537856 1 hugectr.cc:2056] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I0321 08:59:58.537878 1 hugectr.cc:2157]        input CATCOLUMN: datatype = INT64, shape = [1,1916], byte_size = 15328, buffer_count = 1
I0321 08:59:58.537888 1 hugectr.cc:2169]        input ROWINDEX: datatype = INT32, shape = [1,1351], byte_size = 5404, buffer_count = 1
I0321 08:59:58.537896 1 hugectr.cc:2181]        input DES: datatype = FP32, shape = [1,0], byte_size = 0, buffer_count = 0
I0321 08:59:58.537904 1 hugectr.cc:2206]        requested_output OUTPUT0
I0321 08:59:58.537912 1 infer_response.cc:166] add response output: output: OUTPUT0, type: FP32, shape: [90]
I0321 08:59:58.537925 1 http_server.cc:1068] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I0321 08:59:58.537939 1 http_server.cc:1088] HTTP using buffer for: 'OUTPUT0', size: 360, addr: 0x7f7fcc2c5020
I0321 08:59:58.538025 1 hugectr.cc:2372] *****Processing request on device***** 0 for model meb
Signal (11) received.
 0# 0x000055A64FB47299 in tritonserver
 1# 0x00007FE4C8D42210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007FE4C8E8A959 in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
 4# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
 5# 0x00007FE4AC4C629D in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 6# TRITONBACKEND_ModelInstanceExecute in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 7# 0x00007FE4C98E4F9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x00007FE4C98E56B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007FE4C977D1A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007FE4C98DF527 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007FE4C9130DE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007FE4C95AE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
[0x0x7f32f000b348] input: DES, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0]
[0x0x7f32f0023ab8] input: CATCOLUMN, type: INT64, original shape: [1,2949], batch + shape: [1,2949], shape: [2949]
[0x0x7f32f0023c38] input: ROWINDEX, type: INT32, original shape: [1,751], batch + shape: [1,751], shape: [751]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I0324 04:46:46.684514 1 hugectr.cc:1988] model meb, instance meb, executing 1 requests
I0324 04:46:46.684549 1 hugectr.cc:2056] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I0324 04:46:46.684573 1 hugectr.cc:2157]        input CATCOLUMN: datatype = INT64, shape = [1,2949], byte_size = 23592, buffer_count = 1
I0324 04:46:46.684581 1 hugectr.cc:2169]        input ROWINDEX: datatype = INT32, shape = [1,751], byte_size = 3004, buffer_count = 1
I0324 04:46:46.684588 1 hugectr.cc:2181]        input DES: datatype = FP32, shape = [1,0], byte_size = 0, buffer_count = 0
I0324 04:46:46.684595 1 hugectr.cc:2206]        requested_output OUTPUT0
I0324 04:46:46.684605 1 infer_response.cc:166] add response output: output: OUTPUT0, type: FP32, shape: [50]
I0324 04:46:46.684620 1 http_server.cc:1068] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I0324 04:46:46.684628 1 http_server.cc:1088] HTTP using buffer for: 'OUTPUT0', size: 200, addr: 0x7f4f14002bf0
terminate called after throwing an instance of 'std::runtime_error'
  what():  Runtime error: invalid argument /repos/hugectr_inference_backend/src/hugectr.cc:2343

Signal (6) received.
 0# 0x000055ECA62F0299 in tritonserver
 1# 0x00007F568119D210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F5681553911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F568155F38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F568155F3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F568155F6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007F56760AFA9E in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 9# 0x00007F5681D3FF9A in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F5681D406B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F5681BD81A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# 0x00007F5681D3A527 in /opt/tritonserver/bin/../lib/libtritonserver.so
13# 0x00007F568158BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
14# 0x00007F5681A09609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
15# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Signal (11) received.
 0# 0x000055ECA62F0299 in tritonserver
 1# 0x00007F568119D210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00007F5681553911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 4# 0x00007F568155F38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F568155F3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F568155F6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F56760AFA9E in /usr/local/hugectr/backends/hugectr/libtriton_hugectr.so
 8# 0x00007F5681D3FF9A in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x00007F5681D406B7 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F5681BD81A1 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F5681D3A527 in /opt/tritonserver/bin/../lib/libtritonserver.so
12# 0x00007F568158BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
13# 0x00007F5681A09609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
14# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

[BUG] HugeCTR backend crashed when I make pressure test with 100 concurrents,and the coredump file not generated.

HugeCTR backend crashed when I make pressure test with 100 concurrents . and the coredump file not generated.

I1215 03:27:54.113361 33740 hugectr.cc:1438] model h3_hugectr, instance h3_hugectr, executing 1 requests
I1215 03:27:54.113399 33740 hugectr.cc:1512] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I1215 03:27:54.113373 33740 http_server.cc:2727] HTTP request: 2 /v2/models/h3_hugectr/infer
I1215 03:27:54.113420 33740 hugectr.cc:1610]    input CATCOLUMN: datatype = INT64, shape = [1,576], byte_size = 4608, buffer_count = 1
I1215 03:27:54.113434 33740 model_repository_manager.cc:615] GetInferenceBackend() 'h3_hugectr' version -1
I1215 03:27:54.113466 33740 model_repository_manager.cc:615] GetInferenceBackend() 'h3_hugectr' version -1
I1215 03:27:54.113455 33740 hugectr.cc:1626]    input ROWINDEX: datatype = INT32, shape = [1,577], byte_size = 2308, buffer_count = 1
I1215 03:27:54.113478 33740 hugectr.cc:1640]    input DES: datatype = FP32, shape = [1,704], byte_size = 2816, buffer_count = 1
I1215 03:27:54.113484 33740 hugectr.cc:1665]    requested_output OUTPUT0
I1215 03:27:54.113495 33740 infer_response.cc:165] add response output: output: OUTPUT0, type: FP32, shape: [32]
I1215 03:27:54.113493 33740 infer_request.cc:524] prepared: [0x0x7f1f0800d230] request id: 1, model: h3_hugectr, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f1f0800aa88] input: ROWINDEX, type: INT32, original shape: [1,577], batch + shape: [1,577], shape: [577]
[0x0x7f1f0800edc8] input: CATCOLUMN, type: INT64, original shape: [1,576], batch + shape: [1,576], shape: [576]
[0x0x7f1f0800b7d8] input: DES, type: FP32, original shape: [1,704], batch + shape: [1,704], shape: [704]
override inputs:
inputs:
[0x0x7f1f0800b7d8] input: DES, type: FP32, original shape: [1,704], batch + shape: [1,704], shape: [704]
[0x0x7f1f0800edc8] input: CATCOLUMN, type: INT64, original shape: [1,576], batch + shape: [1,576], shape: [576]
[0x0x7f1f0800aa88] input: ROWINDEX, type: INT32, original shape: [1,577], batch + shape: [1,577], shape: [577]
original requested outputs:
OUTPUT0
requested outputs:
OUTPUT0

I1215 03:27:54.113504 33740 http_server.cc:1051] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I1215 03:27:54.113543 33740 http_server.cc:1071] HTTP using buffer for: 'OUTPUT0', size: 128, addr: 0x7f1f44bca620
I1215 03:27:54.113581 33740 hugectr.cc:1438] model h3_hugectr, instance h3_hugectr, executing 1 requests
I1215 03:27:54.113599 33740 hugectr.cc:1512] request 0: id = "1", correlation_id = 0, input_count = 3, requested_output_count = 1
I1215 03:27:54.113615 33740 hugectr.cc:1610]    input CATCOLUMN: datatype = INT64, shape = [1,576], byte_size = 4608, buffer_count = 1
I1215 03:27:54.113629 33740 hugectr.cc:1626]    input ROWINDEX: datatype = INT32, shape = [1,577], byte_size = 2308, buffer_count = 1
I1215 03:27:54.113640 33740 hugectr.cc:1640]    input DES: datatype = FP32, shape = [1,704], byte_size = 2816, buffer_count = 1
I1215 03:27:54.113646 33740 hugectr.cc:1665]    requested_output OUTPUT0
I1215 03:27:54.113651 33740 infer_response.cc:165] add response output: output: OUTPUT0, type: FP32, shape: [32]
I1215 03:27:54.113663 33740 http_server.cc:1051] HTTP: unable to provide 'OUTPUT0' in GPU, will use CPU
I1215 03:27:54.113674 33740 http_server.cc:1071] HTTP using buffer for: 'OUTPUT0', size: 128, addr: 0x7f1fdcbc3b60
I1215 03:27:54.115386 33740 hugectr.cc:1815] *****Processing request on device***** 0 for model h3_hugectr
I1215 03:27:54.120122 33740 hugectr.cc:1815] *****Processing request on device***** 0 for model h3_hugectr
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
Signal (6) received.
terminate called recursively
Signal (6) received.
 0# 0x0000555E676798A9 in tritonserver
 1# 0x00007F22576FF210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F2257AB5911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F2257AC138C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F2257AC13F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F2257AC16A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# std::__throw_system_error(int) in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 9# 0x00007F2257AEE0BD in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
10# HugeCTR::embedding_cache<long long>::look_up(void const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, float*, HugeCTR::MemoryBlock*, std::vector<CUstream_st*, std::allocator<CUstream_st*> > const&, float) in /usr/local/hugectr/lib/libhugectr_inference.so
11# HugeCTR::InferenceSession::predict(float*, void*, int*, float*, int) in /usr/local/hugectr/lib/libhugectr_inference.so
12# 0x00007F22383F404A in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
13# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
14# 0x00007F225828B83A in /opt/tritonserver/bin/../lib/libtritonserver.so
15# 0x00007F225828C04D in /opt/tritonserver/bin/../lib/libtritonserver.so
16# 0x00007F2258140801 in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x00007F2258285DC7 in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x00007F2257AEDDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
19# 0x00007F2257F6B609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
20# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)
root@13609cdec7a0:/opt/tritonserver# ll
total 3612
drwxr-xr-x 1 root          root             4096 Nov 29 08:07  ./
drwxr-xr-x 1 root          root             4096 Oct 21 18:44  ../
-rw-r--r-- 1 root          root              641 Nov  4 19:20 '=1.21.6'
-rw-rw-r-- 1 triton-server triton-server    1485 Oct 21 17:20  LICENSE
-rw-rw-r-- 1 triton-server triton-server 3012640 Oct 21 17:20  NVIDIA_Deep_Learning_Container_License.pdf
-rw-rw-r-- 1 triton-server triton-server       7 Oct 21 17:20  TRITON_VERSION
drwxr-xr-x 1 triton-server triton-server    4096 Nov  4 19:33  backends/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  bin/
drwxr-xr-x 1 triton-server triton-server    4096 Oct 21 18:45  include/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  lib/
-rw------- 1 root          root           616982 Dec 10 02:29  nohup.out
-rwxrwxr-x 1 triton-server triton-server    4090 Oct 21 17:20  nvidia_entrypoint.sh*
drwxr-xr-x 3 triton-server triton-server    4096 Oct 21 18:48  repoagents/
drwxr-xr-x 2 triton-server triton-server    4096 Oct 21 18:45  third-party-src/
root@13609cdec7a0:/opt/tritonserver# ll /opt/tritonserver/backends/hugectr/
total 400
drwxr-xr-x 2 root root   4096 Nov  4 19:33 ./
drwxr-xr-x 3 root root   4096 Nov  4 19:33 ../
-rw-r--r-- 1 root root 398904 Nov  4 19:33 libtriton_hugectr.so


my start command:

 tritonserver --model-repository=/data/gux/h3/h3_hugectr/ --backend-config=hugectr,ps=/data/gux/h3/h3_hugectr/ps.json --load-model h3_hugectr --model-control-mode=explicit --log-verbose 1 

#pressure test  command by siege

$ siege -H "Content-Type:application/json" "http://127.0.0.1:8088/xxxx/v1/api  POST < /home/t3mgr/data.json" -b -c 100  -t 5M

I1215 03:44:22.466890 66071 hugectr.cc:748] The model configuration:
{
"name": "h3_hugectr",
"platform": "",
"backend": "hugectr",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 4096,
"input": [
{
"name": "DES",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
},
{
"name": "CATCOLUMN",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
},
{
"name": "ROWINDEX",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "h3_hugectr_0",
"kind": "KIND_GPU",
"count": 3,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"cat_feature_num": {
"string_value": "18"
},
"config": {
"string_value": "/data/gux/h3/1/dcn.json"
},
"label_dim": {
"string_value": "1"
},
"max_nnz": {
"string_value": "1"
},
"embedding_vector_size": {
"string_value": "128"
},
"gpucacheper": {
"string_value": "0.5"
},
"des_feature_num": {
"string_value": "22"
},
"gpucache": {
"string_value": "true"
},
"embeddingkey_long_type": {
"string_value": "true"
},
"slots": {
"string_value": "18"
}
},
"model_warmup": []
}

Unable to launch Triton Server with hps backend using latest HugeCTR and hugectr_backend repos

Description
I'm unable to install and run the Triton server using the HPS backend.

Triton Information
Triton v23.06

To Reproduce
Steps to reproduce the behavior.

I'm following steps (1) and (2) here (https://github.com/triton-inference-server/hugectr_backend) under the Build the HPS Backend from Scratch section. I follow all the steps exactly.

I'm doing all the steps in a container built from this image (https://github.com/NVIDIA-Merlin/Merlin/blob/main/docker/dockerfile.ctr).

After trying to launch Triton using tritonserver --model-repository=/opt/hugectr_testing/data/test_dask/output/model_inference --backend-config=hps,ps=/opt/hugectr_testing/data/test_dask/output/model_inference/ps.json, I get the following:

I1120 07:09:59.322961 19420 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f4fd6000000' with size 268435456
I1120 07:09:59.323460 19420 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1120 07:09:59.333655 19420 model_lifecycle.cc:462] loading: criteo:1
I1120 07:09:59.333798 19420 model_lifecycle.cc:462] loading: criteo_nvt:1
I1120 07:09:59.370486 19420 hps.cc:62] TRITONBACKEND_Initialize: hps
I1120 07:09:59.370512 19420 hps.cc:69] Triton TRITONBACKEND API version: 1.13
I1120 07:09:59.370519 19420 hps.cc:73] 'hps' TRITONBACKEND API version: 1.15
I1120 07:09:59.370536 19420 hps.cc:150] TRITONBACKEND_Backend Finalize: HPSBackend
E1120 07:09:59.370572 19420 model_lifecycle.cc:626] failed to load 'criteo' version 1: Unsupported: Triton backend API version does not support this backend
I1120 07:09:59.370602 19420 model_lifecycle.cc:753] failed to load 'criteo'
I1120 07:09:59.521670 19436 pb_stub.cc:255]  Failed to initialize Python stub for auto-complete: ModuleNotFoundError: No module named 'hugectr'

At:
  /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load

E1120 07:09:59.529317 19420 model_lifecycle.cc:626] failed to load 'criteo_nvt' version 1: Internal: ModuleNotFoundError: No module named 'hugectr'

At:
  /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load

I1120 07:09:59.529369 19420 model_lifecycle.cc:753] failed to load 'criteo_nvt'
E1120 07:09:59.529473 19420 model_repository_manager.cc:562] Invalid argument: ensemble 'criteo_ens' depends on 'criteo' which has no loaded version. Model 'criteo' loading failed with error: version 1 is at UNAVAILABLE state: Unsupported: Triton backend API version does not support this backend;
I1120 07:09:59.529552 19420 server.cc:603] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1120 07:09:59.529642 19420 server.cc:630] 
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                                        |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1120 07:09:59.529735 19420 server.cc:673] 
+------------+---------+-------------------------------------------------------------------------------------------------+
| Model      | Version | Status                                                                                          |
+------------+---------+-------------------------------------------------------------------------------------------------+
| criteo     | 1       | UNAVAILABLE: Unsupported: Triton backend API version does not support this backend              |
| criteo_nvt | 1       | UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'hugectr'                           |
|            |         |                                                                                                 |
|            |         | At:                                                                                             |
|            |         |   /opt/hugectr_testing/data/test_dask/output/model_inference/criteo_nvt/1/model.py(1): <module> |
|            |         |   <frozen importlib._bootstrap>(241): _call_with_frames_removed                                 |
|            |         |   <frozen importlib._bootstrap_external>(883): exec_module                                      |
|            |         |   <frozen importlib._bootstrap>(703): _load_unlocked                                            |
|            |         |   <frozen importlib._bootstrap>(1006): _find_and_load_unlocked                                  |
|            |         |   <frozen importlib._bootstrap>(1027): _find_and_load                                           |
+------------+---------+-------------------------------------------------------------------------------------------------+

I1120 07:09:59.575939 19420 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB
I1120 07:09:59.576319 19420 metrics.cc:701] Collecting CPU metrics
I1120 07:09:59.576525 19420 tritonserver.cc:2385] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                          |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                         |
| server_version                   | 2.35.0                                                                                                                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace loggin |
|                                  | g                                                                                                                                                                                                              |
| model_repository_path[0]         | /opt/hugectr_testing/data/test_dask/output/model_inference                                                                                                                                                     |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                      |
| strict_model_config              | 0                                                                                                                                                                                                              |
| rate_limit                       | OFF                                                                                                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                       |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                            |
| strict_readiness                 | 1                                                                                                                                                                                                              |
| exit_timeout                     | 30                                                                                                                                                                                                             |
| cache_enabled                    | 0                                                                                                                                                                                                              |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1120 07:09:59.576581 19420 server.cc:304] Waiting for in-flight requests to complete.
I1120 07:09:59.576592 19420 server.cc:320] Timeout 30: Found 0 model versions that have in-flight inferences
I1120 07:09:59.576617 19420 server.cc:335] All models are stopped, unloading models
I1120 07:09:59.576634 19420 server.cc:342] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

I trained my model using this example notebook: https://github.com/NVIDIA-Merlin/Merlin/blob/main/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb

However, this notebook came out before the HugeCTR backend was merged with the HPS backend. As a result, I needed to manually a line in my config.pbtxt to go from the hugectr to hps backend => backend: "hugectr" to backend: "hps".

Expected behavior
First, when building the inference version, I expect hugectr's Python module to be installed, but it isn't. This is weird because when I turn off the -DENABLE_INFERENCE=ON and install, import hugectr works.

Second, I expect the Triton server to start and accept requests.

[BUG] HugeCTR Backend server dies with SIGBUS during server startup (22.04)

I was running the server at merlin-inference:22.04 with the same model and settings as #40, which was able to receive queries, and but it has died with SIGBUS.

  • run.sh
docker run --rm --runtime=nvidia --net=host -e HUGECTR_LOG_LEVEL=0 -it \
    -v `pwd`:/models \
    nvcr.io/nvidia/merlin/merlin-inference:22.04 \
        tritonserver \
        --model-repository=/models/ \
        --load-model=meb \
        --model-control-mode=explicit \
        --backend-directory=/usr/local/hugectr/backends \
        --backend-config=hugectr,ps=/models/meb/ps.json \
        --log-info=true \
        --log-verbose=0
$ ./run.sh                                                                                                                                            [525/538]
Unable to find image 'nvcr.io/nvidia/merlin/merlin-inference:22.04' locally
22.04: Pulling from nvidia/merlin/merlin-inference
4d32b49e2995: Already exists
45893188359a: Pulling fs layer
5ad1f2004580: Pulling fs layer
6ddc1d0f9183: Pulling fs layer
4cc43a803109: Waiting
e94a4481e933: Waiting
3e7e4c9bc2b1: Waiting
9463aa3f5627: Waiting
a4a0c690bc7d: Waiting
59d451175f69: Waiting
eaf45e9f32d1: Waiting
d8d16d6af76d: Waiting
9e04bda98b05: Waiting
4f4fb700ef54: Pull complete
98e1b8b4cf4b: Pull complete
3ba4cd25cab4: Pull complete
e07a05c28244: Pull complete
6a99482f27f4: Pull complete
0a9c87e68332: Pull complete
6d909763dff3: Pull complete
7f01a1b77738: Pull complete
c70caad572e6: Pull complete
c0b57c72d7c7: Pull complete
3b7c493bb8f8: Pull complete
70f21191d5fa: Pull complete
b72ef49a1648: Pull complete
1735193fce1a: Pull complete
6f0a31eb4fc9: Pull complete
5a83b81d8cfd: Pull complete
24c069e055bb: Pull complete
9c90284fcd0f: Pull complete
405c3b74edb7: Pull complete
2c2cfec47605: Pull complete
f9e5bf6b037e: Pull complete
69b1183a0dc9: Pull complete
73133bf37ddc: Pull complete
187e35d56f89: Pull complete
23ec4ade6dcd: Pull complete
4fba3dd7f97c: Pull complete
11923c954056: Pull complete
95b67db4aa6d: Pull complete
73d16c81d9c9: Pull complete
f0a024c8b08f: Pull complete
69b1183a0dc9: Pull complete
73133bf37ddc: Pull complete
187e35d56f89: Pull complete
23ec4ade6dcd: Pull complete
4fba3dd7f97c: Pull complete
11923c954056: Pull complete
95b67db4aa6d: Pull complete
73d16c81d9c9: Pull complete
f0a024c8b08f: Pull complete
099d0dd31169: Pull complete
96d82345047b: Pull complete
188b63e153b6: Pull complete
de97abb09153: Pull complete
01be5700f44b: Pull complete
9f01a696bb8b: Pull complete
3e3d4a57ff34: Pull complete
ccb0ee9eb079: Pull complete
a496569779c9: Pull complete
e9c89d74ffd4: Pull complete
e225aafaa730: Pull complete
ef7b62a5bf12: Pull complete
43449b45a07c: Pull complete
09fcbe8e254c: Pull complete
b85ef4e24a81: Pull complete
Digest: sha256:eeb55e2463291d83b8ad3d05f63a8be641dd1684daa13cafc2ea044546130fd5
Status: Downloaded newer image for nvcr.io/nvidia/merlin/merlin-inference:22.04

==================================
== Triton Inference Server Base ==
==================================

NVIDIA Release 22.03 (build 33743047)

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.6 driver version 510.47.03 with kernel driver version 460.73.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0509 02:01:17.740583 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f7998000000' with size 268435456
I0509 02:01:17.750397 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0509 02:01:17.750411 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0509 02:01:17.750417 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0509 02:01:17.750422 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
I0509 02:01:17.750428 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 4 with size 67108864
I0509 02:01:17.750434 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 5 with size 67108864
I0509 02:01:17.750439 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 6 with size 67108864
I0509 02:01:17.750446 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 7 with size 67108864
I0509 02:01:18.869713 1 model_repository_manager.cc:997] loading: meb:1
I0509 02:01:18.999772 1 hugectr.cc:1597] TRITONBACKEND_Initialize: hugectr
I0509 02:01:18.999796 1 hugectr.cc:1604] Triton TRITONBACKEND API version: 1.8
I0509 02:01:18.999803 1 hugectr.cc:1608] 'hugectr' TRITONBACKEND API version: 1.8
I0509 02:01:18.999810 1 hugectr.cc:1631] The HugeCTR backend Repository location: /usr/local/hugectr/backends/hugectr
I0509 02:01:18.999816 1 hugectr.cc:1640] The HugeCTR backend configuration: {"cmdline":{"ps":"/models/meb/ps.json"}}
I0509 02:01:18.999842 1 hugectr.cc:344] *****Parsing Parameter Server Configuration from /models/meb/ps.json
I0509 02:01:18.999897 1 hugectr.cc:365] Support 64-bit keys = 1
I0509 02:01:18.999912 1 hugectr.cc:376] Volatile database -> type = parallel_hash_map
I0509 02:01:18.999918 1 hugectr.cc:381] Volatile database -> address = 127.0.0.1:7000
I0509 02:01:18.999924 1 hugectr.cc:386] Volatile database -> user name = default
I0509 02:01:18.999932 1 hugectr.cc:390] Volatile database -> password = <empty>
I0509 02:01:18.999938 1 hugectr.cc:397] Volatile database -> algorithm = phm
I0509 02:01:18.999945 1 hugectr.cc:402] Volatile database -> number of partitions = 16
I0509 02:01:18.999951 1 hugectr.cc:408] Volatile database -> max. batch size (GET) = 100000
I0509 02:01:18.999958 1 hugectr.cc:415] Volatile database -> max. batch size (SET) = 100000
I0509 02:01:18.999964 1 hugectr.cc:423] Volatile database -> refresh time after fetch = 0
I0509 02:01:18.999972 1 hugectr.cc:430] Volatile database -> overflow margin = 120000000
I0509 02:01:18.999979 1 hugectr.cc:436] Volatile database -> overflow policy = evict_oldest
I0509 02:01:18.999999 1 hugectr.cc:442] Volatile database -> overflow resolution target = 0.8
I0509 02:01:19.000007 1 hugectr.cc:450] Volatile database -> initial cache rate = 1
I0509 02:01:19.000014 1 hugectr.cc:456] Volatile database -> cache missed embeddings = 0
I0509 02:01:19.000021 1 hugectr.cc:466] Volatile database -> update filters = []
I0509 02:01:19.000060 1 hugectr.cc:583] Model name = meb
I0509 02:01:19.000067 1 hugectr.cc:592] Model 'meb' -> network file = /models/meb/1/meb.json
I0509 02:01:19.000074 1 hugectr.cc:599] Model 'meb' -> max. batch size = 100
I0509 02:01:19.000081 1 hugectr.cc:605] Model 'meb' -> dense model file = /models/meb/1/meb_dense_0.model
I0509 02:01:19.000089 1 hugectr.cc:611] Model 'meb' -> sparse model files = [/models/meb/1/meb0_sparse_0.model]
I0509 02:01:19.000097 1 hugectr.cc:622] Model 'meb' -> use GPU embedding cache = 1                                                                                                    [406/538]
I0509 02:01:19.000106 1 hugectr.cc:631] Model 'meb' -> hit rate threshold = 0.8
I0509 02:01:19.000115 1 hugectr.cc:639] Model 'meb' -> per model GPU cache = 1
I0509 02:01:19.000132 1 hugectr.cc:655] Model 'meb' -> num. pool worker buffers = 4
I0509 02:01:19.000142 1 hugectr.cc:662] Model 'meb' -> num. pool refresh buffers = 1
I0509 02:01:19.000149 1 hugectr.cc:669] Model 'meb' -> cache refresh rate per iteration = 0.2
I0509 02:01:19.000158 1 hugectr.cc:678] Model 'meb' -> deployed device list = [0, 1, 2, 3, 4, 5, 6, 7]
I0509 02:01:19.000167 1 hugectr.cc:686] Model 'meb' -> default value for each table = [0]
I0509 02:01:19.000179 1 hugectr.cc:706] *****The HugeCTR Backend Parameter Server is creating... *****
I0509 02:01:19.000351 1 hugectr.cc:714] ***** Parameter Server(Int64) is creating... *****
I0509 02:03:04.076790 1 hugectr.cc:725] *****The HugeCTR Backend Backend created the Parameter Server successfully! *****
I0509 02:03:04.076884 1 hugectr.cc:1703] TRITONBACKEND_ModelInitialize: meb (version 1)
I0509 02:03:04.076891 1 hugectr.cc:1716] Repository location: /models/meb
I0509 02:03:04.076899 1 hugectr.cc:1731] backend configuration in mode: {"cmdline":{"ps":"/models/meb/ps.json"}}
I0509 02:03:04.078340 1 hugectr.cc:974] Verifying model configuration: {
    "name": "meb",
    "platform": "",
    "backend": "hugectr",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 100,
    "input": [
        {
            "name": "DES",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "CATCOLUMN",
            "data_type": "TYPE_INT64",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "ROWINDEX",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "meb_0",
            "kind": "KIND_GPU",
            "count": 4,
            "gpus": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "max_nnz": {
            "string_value": "8000"
        },
        "embedding_vector_size": {
            "string_value": "15"
        },
        "gpucacheper": {
            "string_value": "1.0"
        },
        "des_feature_num": {
            "string_value": "0"
        },
        "hit_rate_threshold": {
            "string_value": "0.8"
        },
        "gpucache": {
            "string_value": "true"
        },
        "embeddingkey_long_type": {
            "string_value": "true"
        },
        "slots": {
            "string_value": "15"
        },
        "config": {
            "string_value": "/models/meb/1/meb.json"
        },
        "cat_feature_num": {
            "string_value": "8000"
        },
        "label_dim": {
            "string_value": "1"
        }
    },
    "model_warmup": []
}
I0509 02:03:04.078466 1 hugectr.cc:1060] The model configuration: {
    "name": "meb",
    "platform": "",
    "backend": "hugectr",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 100,
    "input": [
        {
            "name": "DES",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "CATCOLUMN",
            "data_type": "TYPE_INT64",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "ROWINDEX",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "meb_0",
            "kind": "KIND_GPU",
            "count": 4,
            "gpus": [
                0,
                1,
                2,
                3,
                4,
                5,
                6,
                7
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "max_nnz": {
            "string_value": "8000"
        },
        "embedding_vector_size": {
            "string_value": "15"
        },
        "gpucacheper": {
            "string_value": "1.0"
        },
        "des_feature_num": {
            "string_value": "0"
        },
        "hit_rate_threshold": {
            "string_value": "0.8"
        },
        "gpucache": {
            "string_value": "true"
        },
        "embeddingkey_long_type": {
            "string_value": "true"
        },
        "slots": {
            "string_value": "15"
        },
        "config": {
            "string_value": "/models/meb/1/meb.json"
        },
        "cat_feature_num": {
            "string_value": "8000"
        },
        "label_dim": {
            "string_value": "1"
        }
    },
    "model_warmup": []
}
I0509 02:03:04.078548 1 hugectr.cc:1105] slots set = 15
I0509 02:03:04.078555 1 hugectr.cc:1111] desene number = 0
I0509 02:03:04.078562 1 hugectr.cc:1117] cat_feature number = 8000
I0509 02:03:04.078569 1 hugectr.cc:1129] embedding size = 15
I0509 02:03:04.078576 1 hugectr.cc:1135] maxnnz = 8000
I0509 02:03:04.078583 1 hugectr.cc:1153] HugeCTR model config path = /models/meb/1/meb.json
I0509 02:03:04.078593 1 hugectr.cc:1176] support gpu cache = 1
I0509 02:03:04.078610 1 hugectr.cc:1199] gpu cache per = 1
I0509 02:03:04.078619 1 hugectr.cc:1216] hit-rate threshold = 0.8
I0509 02:03:04.078626 1 hugectr.cc:1232] Label dim = 1
I0509 02:03:04.078633 1 hugectr.cc:1238] support 64-bit embedding key = 1
I0509 02:03:04.078639 1 hugectr.cc:1252] Model_Inference_Para.max_batchsize: 100
I0509 02:03:04.078645 1 hugectr.cc:1256] max_batch_size in model config.pbtxt is 100
I0509 02:03:04.078654 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 0
I0509 02:03:04.078661 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 1
I0509 02:03:04.078667 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 2
I0509 02:03:04.078673 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 3
I0509 02:03:04.078680 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 4
I0509 02:03:04.078686 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 5
I0509 02:03:04.078693 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 6
I0509 02:03:04.078699 1 hugectr.cc:1326] ******Creating Embedding Cache for model meb in device 7
I0509 02:03:04.078705 1 hugectr.cc:1353] ******Creating Embedding Cache for model meb successfully
I0509 02:03:04.084639 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 0)
I0509 02:03:04.084655 1 hugectr.cc:1495] Triton Model Instance Initialization on device 0
I0509 02:03:04.092570 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:04.092584 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:04.119280 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:04.119445 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:04.119527 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:04.119535 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:06.406007 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:06.406216 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 1)
I0509 02:03:06.406245 1 hugectr.cc:1495] Triton Model Instance Initialization on device 1
I0509 02:03:06.406305 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:06.406313 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:06.423536 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:06.423681 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:06.423781 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:06.423789 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:07.636730 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:07.636912 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 2)
I0509 02:03:07.636927 1 hugectr.cc:1495] Triton Model Instance Initialization on device 2
I0509 02:03:07.636933 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:07.636941 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:07.653589 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:07.653740 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:07.653842 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:07.653850 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:08.864003 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:08.864193 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 3)
I0509 02:03:08.864208 1 hugectr.cc:1495] Triton Model Instance Initialization on device 3
I0509 02:03:08.864215 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:08.864239 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:08.881026 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:08.881182 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:08.881300 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:08.881309 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:10.089367 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:10.089537 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 4)
I0509 02:03:10.089549 1 hugectr.cc:1495] Triton Model Instance Initialization on device 4
I0509 02:03:10.089556 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:10.089563 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:10.106748 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:10.106901 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:10.107012 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:10.107020 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:11.315555 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:11.315748 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 5)
I0509 02:03:11.315763 1 hugectr.cc:1495] Triton Model Instance Initialization on device 5
I0509 02:03:11.315770 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:11.315776 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:11.333771 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:11.333941 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:11.334056 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:11.334065 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:12.548906 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:12.549089 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 6)
I0509 02:03:12.549103 1 hugectr.cc:1495] Triton Model Instance Initialization on device 6
I0509 02:03:12.549110 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:12.549116 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:12.566637 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:12.566807 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:12.566930 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:12.566939 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:13.789554 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:13.789746 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_0 (device 7)
I0509 02:03:13.789761 1 hugectr.cc:1495] Triton Model Instance Initialization on device 7
I0509 02:03:13.789768 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:13.789775 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:13.807076 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:13.807262 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:13.807390 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:13.807399 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.054373 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.054532 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 0)
I0509 02:03:15.054545 1 hugectr.cc:1495] Triton Model Instance Initialization on device 0
I0509 02:03:15.054551 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.054558 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.059515 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.059646 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.059775 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.059784 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.435527 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.435710 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 1)
I0509 02:03:15.435724 1 hugectr.cc:1495] Triton Model Instance Initialization on device 1
I0509 02:03:15.435733 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.435742 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.440908 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.441018 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.441116 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.441124 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:15.809277 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:15.809436 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 2)
I0509 02:03:15.809449 1 hugectr.cc:1495] Triton Model Instance Initialization on device 2
I0509 02:03:15.809455 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:15.809462 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:15.814312 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:15.814412 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:15.814511 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:15.814519 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.182679 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.182843 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 3)
I0509 02:03:16.182857 1 hugectr.cc:1495] Triton Model Instance Initialization on device 3
I0509 02:03:16.182863 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.182870 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.188940 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.189040 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.189140 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.189148 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.567086 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.567323 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 4)
I0509 02:03:16.567338 1 hugectr.cc:1495] Triton Model Instance Initialization on device 4
I0509 02:03:16.567344 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.567350 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.572950 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.573061 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.573169 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.573177 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:16.954819 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:16.954976 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 5)
I0509 02:03:16.954989 1 hugectr.cc:1495] Triton Model Instance Initialization on device 5
I0509 02:03:16.954996 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:16.955001 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:16.960033 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:16.960144 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:16.960262 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:16.960270 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:17.335444 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:17.335611 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 6)
I0509 02:03:17.335624 1 hugectr.cc:1495] Triton Model Instance Initialization on device 6
I0509 02:03:17.335631 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:17.335637 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:17.340769 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:17.340877 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:17.340976 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:17.340983 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
I0509 02:03:17.725151 1 hugectr.cc:1565] ******Loading HugeCTR model successfully
I0509 02:03:17.725387 1 hugectr.cc:1851] TRITONBACKEND_ModelInstanceInitialize: meb_0_1 (device 7)
I0509 02:03:17.725400 1 hugectr.cc:1495] Triton Model Instance Initialization on device 7
I0509 02:03:17.725407 1 hugectr.cc:1505] Dense Feature buffer allocation:
I0509 02:03:17.725412 1 hugectr.cc:1512] Categorical Feature buffer allocation:
I0509 02:03:17.730393 1 hugectr.cc:1530] Categorical Row Index buffer allocation:
I0509 02:03:17.730501 1 hugectr.cc:1540] Predict result buffer allocation:
I0509 02:03:17.730597 1 hugectr.cc:1864] ******Loading HugeCTR Model******
I0509 02:03:17.730606 1 hugectr.cc:1558] The model origin json configuration file path is: /models/meb/1/meb.json
[qgpuvb813-cvision:1    :0:381] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:    381) ====
 0 0x00000000000143c0 __funlockfile()  ???:0
 1 0x000000000018ba51 __nss_database_lookup()  ???:0
 2 0x0000000000068d6c ncclGroupEnd()  ???:0
 3 0x000000000005de2d ncclGroupEnd()  ???:0
 4 0x0000000000008609 start_thread()  ???:0
 5 0x000000000011f163 clone()  ???:0
=================================

Inconsistency between hugeCTR v3.3.1 and hugectr_backend v3.3.1

Hi Folks,

I just tried building the latest tags from scratch and hugectr_backend fails with three issues:

  1. Use of <filesystem> requires c++std17
  2. It looks like hugectr_backend anticipates the need for {db_type, redis_ip, rocksdb_path, cache_size_percentage_redis} to kept in HugeCTR::InferenceParams but they are nowhere used at present; either add the members or hold off on the assignments until they are needed.
  3. Warning as fatal error; variable assigned and not used "float cache_size_percentage_redis=0.1
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 215b520..479dd2d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -99,7 +99,7 @@ target_include_directories(
     ${CMAKE_CURRENT_SOURCE_DIR}/src
 )

-target_compile_features(triton-hugectr-backend PRIVATE cxx_std_11)
+target_compile_features(triton-hugectr-backend PRIVATE cxx_std_17)
 target_compile_options(
   triton-hugectr-backend PRIVATE
   $<$<OR:$<CXX_COMPILER_ID:Clang>,$<CXX_COMPILER_ID:AppleClang>,$<CXX_COMPILER_ID:GNU>>:
diff --git a/src/hugectr.cc b/src/hugectr.cc
index 2979c85..629259a 100644
--- a/src/hugectr.cc
+++ b/src/hugectr.cc
@@ -298,7 +298,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
   LOG_MESSAGE(TRITONSERVER_LOG_INFO,(std::string("The depolyment Data base type is: ") + db_type).c_str());

   float cache_size_percentage_redis=0.1;
-  std::string cpu_cache_per;
+  std::string cpu_cache_per = std::to_string(cache_size_percentage_redis);
   parameter_server_config.MemberAsString("cache_size_percentage_redis", &cpu_cache_per);
   cache_size_percentage_redis=std::atof(cpu_cache_per.c_str());
   LOG_MESSAGE(TRITONSERVER_LOG_INFO,(std::string("The depolyment cache_size_percentage_redis is: ") + cpu_cache_per).c_str());
@@ -340,6 +340,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
     }

     HugeCTR::InferenceParams infer_param(modelname, 64, 0.55, dense, sparses, 0, true, 0.55, support_int64_key_);
+    /*
     if(db_type== "local"){
       infer_param.db_type=HugeCTR::DATABASE_TYPE::LOCAL;
     }
@@ -353,6 +354,7 @@ HugeCTRBackend::ParseParameterServer(const std::string& path){
     infer_param.redis_ip =redis_ip;
     infer_param.rocksdb_path = rocksdb_path;
     infer_param.cache_size_percentage_redis = cache_size_percentage_redis;
+     */
     inference_params_map.insert(std::pair<std::string, HugeCTR::InferenceParams>(modelname, infer_param));
   }
   return nullptr;

The max_batch_size configuration not work in batching inference?

I have changed the "max_batch_size" in ps.json and /model/wdl/config.pbtxt, configured 64 , 256, 1024 and 4096 , and make pressure test with 100 http client, and check log, found inference batch not change, the reponse time also samilar in different max_batch_size configuration.

Is max_batch_size configure available in hugectr_backend ? Any suggestion about the value and function of max_batch_size?

how to update model weight (triton_server) after continue training!!

hello. we test continue training.

we follow this example. https://github.com/triton-inference-server/hugectr_backend/blob/v3.5/samples/hierarchical_deployment/hps_e2e_demo/Continuous_Training.ipynb

so we tested this sample with our simple dataset. first we use Kafka and training. and we config Kafka setting in triton ps.json. it work well.

  • hugectr training log
    스크린샷 2022-05-31 오후 2 58 50

  • triton log
    스크린샷 2022-05-31 오후 2 56 03

but we can't find how to update model weight triton server after update embedding.

we found other way. that is use load api. but load api load model weight and embedding too. in case. we don't need use Kafka. just update model, embedding. but we can't find example how to use load api.

스크린샷 2022-05-31 오후 3 01 14

so we found two triton embedding update scenario

  1. continue training with Kafka, triton. but just update embedding. (we want know how to update model weight after that or freeze_dense?)

  2. continue training no Kafka, triton. use load api and load model weight and update embedding.

hugectr memory pool is empty

hello. i tested hugectr backend again. and I use perf_analyzer

I use merlin-inference:22:03 and v100 hugectr backend use 28G gpu. so there is 4G gpu memory remain
스크린샷 2022-04-08 오후 1 18 02

when I tested with perf_analyzer concurrency 17 then throughput rapidly drop and hugectr log with memory pool is empty. how can I fix this?

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1906.57 infer/sec, latency 520 usec
Concurrency: 5, throughput: 7677.62 infer/sec, latency 647 usec
Concurrency: 9, throughput: 11505.2 infer/sec, latency 778 usec
Concurrency: 13, throughput: 12422.1 infer/sec, latency 1042 usec
Concurrency: 17, throughput: 3247.32 infer/sec, latency 5232 usec
Concurrency: 21, throughput: 764.767 infer/sec, latency 27540 usec
Concurrency: 25, throughput: 240.7 infer/sec, latency 103849 usec
Concurrency: 29, throughput: 217.05 infer/sec, latency 133800 usec
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][INFO][RK0][EC insert #10]: *****Insert embedding cache of model meb on device 4*****
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139898459779072]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty
[HCTR][02:36:36][WARNING][RK0][tid #139916302344192]: memory pool is empty

and I want change huge ctr log option. I already use triton_server option --log-verbose=0 and --log-info=false. but hugectr log display info log. I can I change hugectr log option?

[BUG]Load criteo model failed in triton inference

Describe the bug
Load criteo model failed in triton inference.

Steps/Code to reproduce bug

  1. tritonserver --model-repository=/model/ --backend-config=hugectr,ps=/model/ps.json --model-control-mode=explicit
  2. triton_client = tritonhttpclient.InferenceServerClient(url="localhost:8000", verbose=True)
  3. triton_client.load_model(model_name="criteo")

Expected behavior
load criteo model successfully.

Environment details (please complete the following information):

  • Environment location: Docker
  • Method of NVTabular install: Docker
    • docker pull nvcr.io/nvidia/merlin/merlin-inference:21.11
    • docker run -itd --gpus=all -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /data:/data nvcr.io/nvidia/merlin/merlin-inference:21.11
    • docker exec -it xxx /bin/bash
    • tritonserver --model-repository=/model/ --backend-config=hugectr,ps=/model/ps.json --model-control-mode=explicit

Additional context
I1123 10:31:24.977723 421 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I1123 10:31:24.977975 421 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I1123 10:31:25.019065 421 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
I1123 10:31:53.109162 421 model_repository_manager.cc:1022] loading: criteo:1
I1123 10:31:53.227686 421 hugectr.cc:1140] TRITONBACKEND_Initialize: hugectr
I1123 10:31:53.227720 421 hugectr.cc:1150] Triton TRITONBACKEND API version: 1.6
I1123 10:31:53.227730 421 hugectr.cc:1156] 'hugectr' TRITONBACKEND API version: 1.6
I1123 10:31:53.227735 421 hugectr.cc:1181] The HugeCTR backend Repository location: /opt/tritonserver/backends/hugectr
I1123 10:31:53.227740 421 hugectr.cc:1191] The HugeCTR backend configuration:
{"cmdline":{"ps":"/model/ps.json"}}
I1123 10:31:53.227778 421 hugectr.cc:311] *****Parsing Parameter Server Configuration from /model/ps.json
I1123 10:31:53.227801 421 hugectr.cc:325] Enable support for Int64 embedding key: 0
I1123 10:31:53.227811 421 hugectr.cc:329] The depolyment Data base type is: local
I1123 10:31:53.227820 421 hugectr.cc:335] The depolyment cache_size_percentage_redis is:
I1123 10:31:53.227830 421 hugectr.cc:339] Redis ip is: 127.0.0.1:7000
I1123 10:31:53.227837 421 hugectr.cc:343] Local RocksDB path is:
I1123 10:31:53.227848 421 hugectr.cc:433] The HugeCTR Backend Parameter Server is creating...
I1123 10:31:53.227857 421 hugectr.cc:446] The HugeCTR Backend Backend Parameter Server(Int32) is creating...
Signal (11) received.
0# 0x00005622F47A58A9 in tritonserver
1# 0x00007F02F22BD210 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# HugeCTR::parameter_server::parameter_server(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&, std::vector<HugeCTR::InferenceParams, std::allocatorHugeCTR::InferenceParams >&) in /usr/local/hugectr/lib/libhugectr_inference.so
3# HugeCTR::HugectrUtility::Create_Parameter_Server(HugeCTR::INFER_TYPE, std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&, std::vector<HugeCTR::InferenceParams, std::allocatorHugeCTR::InferenceParams >&) in /usr/local/hugectr/lib/libhugectr_inference.so
4# 0x00007F02D1643F52 in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
5# TRITONBACKEND_Initialize in /opt/tritonserver/backends/hugectr/libtriton_hugectr.so
6# 0x00007F02F2E34F7B in /opt/tritonserver/bin/../lib/libtritonserver.so
7# 0x00007F02F2E369FB in /opt/tritonserver/bin/../lib/libtritonserver.so
8# 0x00007F02F2E3F000 in /opt/tritonserver/bin/../lib/libtritonserver.so
9# 0x00007F02F2CE59BA in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x00007F02F2CF37B1 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# 0x00007F02F26ABDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F02F2B29609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Segmentation fault (core dumped)

How to turn off only the request log? (The request log is too verbose, resulting in poor throughput)

The more detailed the server startup log, the better it will prevent mistakes or help debugging.
However, if the request log becomes too verbose, it leads to poor server throughput performance.
In fact, turning on the info log alone cuts the server throughput in half for the same 40 thread request.

How can I make it quiet by turning off only the request log while leaving the info log on when the server starts?

Below is the client-side benchmark result for the same request.

--log-info=false

This leaves no lines in the server log.

[Q0000041000][T12] 2282.60QPS avg.etl/infer/total:  3.87ms  10.00ms  15.20ms top.etl/infer/total:  9.00ms  15.00ms  22.00ms
[Q0000042000][T10] 2281.58QPS avg.etl/infer/total:  3.69ms  11.43ms  16.29ms top.etl/infer/total:  8.00ms  17.00ms  25.00ms
[Q0000043000][T11] 2282.70QPS avg.etl/infer/total:  4.13ms  11.52ms  16.87ms top.etl/infer/total: 10.00ms  21.00ms  28.00ms
[Q0000044000][T14] 2281.04QPS avg.etl/infer/total:  4.10ms  10.98ms  16.33ms top.etl/infer/total:  9.00ms  18.00ms  26.00ms
[Q0000045000][T33] 2280.70QPS avg.etl/infer/total:  3.43ms  11.81ms  16.51ms top.etl/infer/total:  8.00ms  19.00ms  24.00ms
[Q0000046000][T03] 2279.96QPS avg.etl/infer/total:  4.07ms   9.99ms  15.36ms top.etl/infer/total: 10.00ms  16.00ms  23.00ms

--log-info=true

This option makes the server request log too verbose, cutting throughput in half.

[Q0000041000][T11] 1108.63QPS avg.etl/infer/total:  3.89ms  37.53ms  42.71ms top.etl/infer/total:  9.00ms  69.00ms  77.00ms
[Q0000042000][T26] 1108.04QPS avg.etl/infer/total:  4.26ms  22.60ms  28.14ms top.etl/infer/total: 10.00ms  33.00ms  40.00ms
[Q0000043000][T37] 1107.30QPS avg.etl/infer/total:  3.88ms  23.35ms  28.63ms top.etl/infer/total: 10.00ms  39.00ms  45.00ms
[Q0000044000][T19] 1109.48QPS avg.etl/infer/total:  4.11ms  23.69ms  29.10ms top.etl/infer/total: 10.00ms  35.00ms  42.00ms
[Q0000045000][T17] 1107.12QPS avg.etl/infer/total:  3.81ms  41.57ms  46.75ms top.etl/infer/total:  9.00ms  75.00ms  81.00ms
[Q0000046000][T27] 1108.08QPS avg.etl/infer/total:  4.35ms  24.15ms  29.93ms top.etl/infer/total: 14.00ms  32.00ms  43.00ms

HugeCTR model on sagemaker endpoint with Hierarchical Parameter Serve

Hi all,

I'm looking to train a deploy a HugeCTR model. The model is a shallow model mostly based on user-item embeddings. I have about 180K users and 8K items.
I'm not interested in updating these embeddings, thus I'm not looking for the kafka component, but I'm looking for the ability of swapping embeddings from the GPUs cache to the CPU cache to reduce latency. Is this part of the Hierarchical Parameter Serve or is natively supported by any HugeCTR model?

how to put all embedding volume into gpu embedding cache?

  • image: merlin-inference:22.03
  • machine: v100
  • request: 10 milion request (repeat)

we tested hugectr backend. 10 million request repeat(1 epoch) ~repeat(2 epoch) ...repeat(50 epoch) then cpu usage is continuous lower. it may be because gpu embedding cache. more embedding insert gpu embedding cache then cpu usage, rps improve.
스크린샷 2022-04-21 오후 6 35 51

so we make small volume to fit in gpu cache. v100 gpu memory is 32GB. so we make small embedding volume 19GB.
this volume fit in gpu cache. so we expect huge cpu usage, rps improvement first epoch. but result is same.

in first epoch, cpu usage still high 70~80%...... if gpu embedding cache loading all volume in initial time. then cpu usage can be very low.

this is our ps.json. how can we load all volume to gpu embedding cache? we already set gpucacheper 1.0. but useless

{
    "supportlonglong":"true",
    "volatile_db": {
            "type":"parallel_hash_map",
            "initial_cache_rate":1.0,
            "max_get_batch_size": 100000,
            "max_set_batch_size": 100000
    },
    "models":[
        {
            "model":"meb",
            "supportlonglong":true,
            "num_of_worker_buffer_in_pool":"8",
            "num_of_refresher_buffer_in_pool":"1",
            "deployed_device_list":[0, 1, 2, 3, 4, 5, 6, 7],
            "max_batch_size":100,
            "default_value_for_each_table":[0.0],
            "hit_rate_threshold":"1.1",
            "gpucacheper":"1.0",
            "gpucache":"true",
            "cache_refresh_percentage_per_iteration":0.0,
            "sparse_files":["/infer/models/video-meb-bmtest-v3/1/meb0_sparse_0.model"],
            "dense_file":"/infer/models/video-meb-bmtest-v3/1/meb_dense_0.model",
            "network_file":"/infer/models/video-meb-bmtest-v3/1/meb.json"
        }
    ]
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.