Giter Site home page Giter Site logo

Comments (7)

philipp-schmidt avatar philipp-schmidt commented on June 14, 2024

Hi, please verify your GPU is actually running this by looking at nvidia-smi. Also try to use tritons own performance analysis tool that you can find bundled with their SDK or packaged in the Docker Images and test your endpoint using that tool to ensure it's not a mistake in your client side code.

from yolov4-triton-tensorrt.

dominusLabs avatar dominusLabs commented on June 14, 2024

Hi

I checked nvidia-smi and it does indeed load the model into GPU memory. However during inference, memory does not actually increase in usage, which I am assuming it should. So maybe on inference request, it reverts to using the CPU for inference, hence why it takes much longer.

However, when running the model, the following output is produced

{
    "name": "yolov4",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "input",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1,
                3,
                416,
                416
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "boxes",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                1,
                4
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "confs",
            "data_type": "TYPE_FP32",
            "dims": [
                1,
                10647,
                18
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "dynamic_batching": {
        "preferred_batch_size": [],
        "max_queue_delay_microseconds": 0,
        "preserve_ordering": false,
        "priority_levels": 0,
        "default_priority_level": 0,
        "priority_queue_policy": {}
    },
    "instance_group": [
        {
            "name": "yolov4",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}

Additionally, according to the logs from my initial post, the following occurs

I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests

So it seems the actual inference request only takes ~10-12 milliseconds. But the total round trip time is still 3 seconds.

This is the client script I am using to send the infer requests

def main():
    triton_client = grpcclient.InferenceServerClient(
            url=FLAGS.url,
            verbose=FLAGS.verbose)
            
    inputs = []
    outputs = []
    inputs.append(grpcclient.InferInput('input', [1, 3, FLAGS.width, FLAGS.height], "FP32"))
    outputs.append(grpcclient.InferRequestedOutput('boxes'))

    print("Creating buffer from image file...")
    input_image = cv2.imread(str(FLAGS.input))
    if input_image is None:
        print(f"FAILED: could not load input image {str(FLAGS.input)}")
        sys.exit(1)

    input_image_buffer = preprocess(input_image, [FLAGS.width, FLAGS.height])
    input_image_buffer = np.expand_dims(input_image_buffer, axis=0)

    inputs[0].set_data_from_numpy(input_image_buffer)

    print("Invoking inference...")

    print(int(time.time()))
    results = triton_client.infer(model_name=FLAGS.model,
                                      inputs=inputs,
                                      outputs=outputs)
    result = results.as_numpy('boxes')
    # print(f"Received result buffer of size {result.shape}")
    # print(f"Naive buffer sum: {np.sum(result)}")

    detected_objects = postprocess(result, input_image.shape[1], input_image.shape[0], [FLAGS.width, FLAGS.height], FLAGS.confidence, FLAGS.nms)
    #print(f"Detected objects: {len(detected_objects)}")

    print(int(time.time()))

I ran the performance analysis tool and these are the results
Screen Shot 2022-03-20 at 12 10 20 PM

from yolov4-triton-tensorrt.

philipp-schmidt avatar philipp-schmidt commented on June 14, 2024

According to the perf_client everything is working as expected. I don't see where you measure 3 seconds round trip.

from yolov4-triton-tensorrt.

dominusLabs avatar dominusLabs commented on June 14, 2024

Please check the very first post in this issue, I pasted the logs in it.

Below is the output from the client.py script I sent above. As you can see it prints out the current time (epoch) right before it sends the request to the server for inference and then prints out the current time after it has gotten the result. The time difference is 3 seconds.

In the logs I sent in the first post, the very first line prints out

I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START

The very last line prints out

I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6

As you can see, it started at 05:09:58 and finished at 05:10:01 - a 3 second difference.

I don't believe the performance client is an accurate measurement, because I don't think it is sending an actual image. The command I used to run the perf client is as below

perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4

Screen Shot 2022-03-20 at 1 20 25 PM

from yolov4-triton-tensorrt.

philipp-schmidt avatar philipp-schmidt commented on June 14, 2024

The perf_client does send an actual image - you can even specify the image if you want. The problem definitely is in the client code. Please try the client.py from this repo and see if there is a difference. I'm unsure about the triton logs and what the timings mean in more detail.

from yolov4-triton-tensorrt.

dominusLabs avatar dominusLabs commented on June 14, 2024

It seems it was an error in the client - what is was exactly I am unsure of.

However, another problem seems to have come up - the detections from the model don't seem correct. The client seems to be looking for "detections". However, Trition only allows "boxes" and "confs" as the output now. I replaced "detections" in the client.py script with "boxes" and kept everything else the same. But when I did that, the detections were all messed up (see image below).

We tested this exact image with our model before Triton, with just TensorRT and it detected everything as expected, so it certainly cannot be because of the model.

ok

from yolov4-triton-tensorrt.

philipp-schmidt avatar philipp-schmidt commented on June 14, 2024

No the output of the newest version of our networks is called "detections" and consists of just one array, please check again.

from yolov4-triton-tensorrt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.