Hello, Just recently setup Triton to work with our YoloV4/TensorRT m

Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference about yolov4-triton-tensorrt HOT 7 CLOSED

dominusLabs commented on June 14, 2024

Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference

from yolov4-triton-tensorrt.

Comments (7)

philipp-schmidt commented on June 14, 2024

Hi, please verify your GPU is actually running this by looking at nvidia-smi. Also try to use tritons own performance analysis tool that you can find bundled with their SDK or packaged in the Docker Images and test your endpoint using that tool to ensure it's not a mistake in your client side code.

from yolov4-triton-tensorrt.

dominusLabs commented on June 14, 2024

I checked nvidia-smi and it does indeed load the model into GPU memory. However during inference, memory does not actually increase in usage, which I am assuming it should. So maybe on inference request, it reverts to using the CPU for inference, hence why it takes much longer.

However, when running the model, the following output is produced

{
    "name": "yolov4",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                3,
                416,
                416
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "boxes",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                1,
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "confs",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                18
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "dynamic_batching": {
        "preferred_batch_size": [],
        "max_queue_delay_microseconds": 0,
        "preserve_ordering": false,
        "priority_levels": 0,
        "default_priority_level": 0,
        "priority_queue_policy": {}
    },
    "instance_group": [
        {
            "name": "yolov4",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}

Additionally, according to the logs from my initial post, the following occurs

I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests

So it seems the actual inference request only takes ~10-12 milliseconds. But the total round trip time is still 3 seconds.

This is the client script I am using to send the infer requests

def main():
    triton_client = grpcclient.InferenceServerClient(
            url=FLAGS.url,
            verbose=FLAGS.verbose)
            
    inputs = []
    outputs = []
    inputs.append(grpcclient.InferInput('input', [1, 3, FLAGS.width, FLAGS.height], "FP32"))
    outputs.append(grpcclient.InferRequestedOutput('boxes'))

    print("Creating buffer from image file...")
    input_image = cv2.imread(str(FLAGS.input))
    if input_image is None:
        print(f"FAILED: could not load input image {str(FLAGS.input)}")
        sys.exit(1)

    input_image_buffer = preprocess(input_image, [FLAGS.width, FLAGS.height])
    input_image_buffer = np.expand_dims(input_image_buffer, axis=0)

    inputs[0].set_data_from_numpy(input_image_buffer)

    print("Invoking inference...")

    print(int(time.time()))
    results = triton_client.infer(model_name=FLAGS.model,
                                      inputs=inputs,
                                      outputs=outputs)
    result = results.as_numpy('boxes')
    # print(f"Received result buffer of size {result.shape}")
    # print(f"Naive buffer sum: {np.sum(result)}")

    detected_objects = postprocess(result, input_image.shape[1], input_image.shape[0], [FLAGS.width, FLAGS.height], FLAGS.confidence, FLAGS.nms)
    #print(f"Detected objects: {len(detected_objects)}")

    print(int(time.time()))

I ran the performance analysis tool and these are the results

from yolov4-triton-tensorrt.

philipp-schmidt commented on June 14, 2024

According to the perf_client everything is working as expected. I don't see where you measure 3 seconds round trip.

from yolov4-triton-tensorrt.

dominusLabs commented on June 14, 2024

Please check the very first post in this issue, I pasted the logs in it.

Below is the output from the client.py script I sent above. As you can see it prints out the current time (epoch) right before it sends the request to the server for inference and then prints out the current time after it has gotten the result. The time difference is 3 seconds.

In the logs I sent in the first post, the very first line prints out

I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START

The very last line prints out

I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6

As you can see, it started at 05:09:58 and finished at 05:10:01 - a 3 second difference.

I don't believe the performance client is an accurate measurement, because I don't think it is sending an actual image. The command I used to run the perf client is as below

perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4

from yolov4-triton-tensorrt.

philipp-schmidt commented on June 14, 2024

The perf_client does send an actual image - you can even specify the image if you want. The problem definitely is in the client code. Please try the client.py from this repo and see if there is a difference. I'm unsure about the triton logs and what the timings mean in more detail.

from yolov4-triton-tensorrt.

dominusLabs commented on June 14, 2024

It seems it was an error in the client - what is was exactly I am unsure of.

However, another problem seems to have come up - the detections from the model don't seem correct. The client seems to be looking for "detections". However, Trition only allows "boxes" and "confs" as the output now. I replaced "detections" in the client.py script with "boxes" and kept everything else the same. But when I did that, the detections were all messed up (see image below).

We tested this exact image with our model before Triton, with just TensorRT and it detected everything as expected, so it certainly cannot be because of the model.

from yolov4-triton-tensorrt.

philipp-schmidt commented on June 14, 2024

No the output of the newest version of our networks is called "detections" and consists of just one array, please check again.

from yolov4-triton-tensorrt.

Triton Inference Server taking adding 3 seconds to get YOLOv4 Inference about yolov4-triton-tensorrt HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent