Comments (7)
Hi, please verify your GPU is actually running this by looking at nvidia-smi. Also try to use tritons own performance analysis tool that you can find bundled with their SDK or packaged in the Docker Images and test your endpoint using that tool to ensure it's not a mistake in your client side code.
from yolov4-triton-tensorrt.
Hi
I checked nvidia-smi and it does indeed load the model into GPU memory. However during inference, memory does not actually increase in usage, which I am assuming it should. So maybe on inference request, it reverts to using the CPU for inference, hence why it takes much longer.
However, when running the model, the following output is produced
{
"name": "yolov4",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "input",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1,
3,
416,
416
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "boxes",
"data_type": "TYPE_FP32",
"dims": [
1,
10647,
1,
4
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "confs",
"data_type": "TYPE_FP32",
"dims": [
1,
10647,
18
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"dynamic_batching": {
"preferred_batch_size": [],
"max_queue_delay_microseconds": 0,
"preserve_ordering": false,
"priority_levels": 0,
"default_priority_level": 0,
"priority_queue_policy": {}
},
"instance_group": [
{
"name": "yolov4",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
Additionally, according to the logs from my initial post, the following occurs
I0320 05:10:01.545749 1 tensorrt.cc:5376] model yolov4, instance yolov4, executing 1 requests
I0320 05:10:01.545785 1 tensorrt.cc:1609] TRITONBACKEND_ModelExecute: Issuing yolov4 with 1 requests
I0320 05:10:01.545796 1 tensorrt.cc:1668] TRITONBACKEND_ModelExecute: Running yolov4 with 1 requests
I0320 05:10:01.545842 1 tensorrt.cc:2804] Optimization profile default [0] is selected for yolov4
I0320 05:10:01.545896 1 pinned_memory_manager.cc:161] pinned memory allocation: size 2076672, addr 0x7fdff8000090
I0320 05:10:01.546688 1 tensorrt.cc:2181] Context with profile default [0] is being executed for yolov4
I0320 05:10:01.549341 1 infer_response.cc:166] add response output: output: boxes, type: FP32, shape: [1,10647,1,4]
I0320 05:10:01.549391 1 grpc_server.cc:2498] GRPC: unable to provide 'boxes' in GPU, will use CPU
I0320 05:10:01.549439 1 grpc_server.cc:2509] GRPC: using buffer for 'boxes', size: 170352, addr: 0x7fdf5800e960
I0320 05:10:01.549454 1 pinned_memory_manager.cc:161] pinned memory allocation: size 170352, addr 0x7fdff81fb0a0
I0320 05:10:01.558050 1 grpc_server.cc:3572] ModelInferHandler::InferResponseComplete, 6 step ISSUED
I0320 05:10:01.558085 1 grpc_server.cc:2591] GRPC free: size 170352, addr 0x7fdf5800e960
I0320 05:10:01.558338 1 grpc_server.cc:3148] ModelInferHandler::InferRequestComplete
I0320 05:10:01.558373 1 tensorrt.cc:2661] TRITONBACKEND_ModelExecute: model yolov4 released 1 requests
So it seems the actual inference request only takes ~10-12 milliseconds. But the total round trip time is still 3 seconds.
This is the client script I am using to send the infer requests
def main():
triton_client = grpcclient.InferenceServerClient(
url=FLAGS.url,
verbose=FLAGS.verbose)
inputs = []
outputs = []
inputs.append(grpcclient.InferInput('input', [1, 3, FLAGS.width, FLAGS.height], "FP32"))
outputs.append(grpcclient.InferRequestedOutput('boxes'))
print("Creating buffer from image file...")
input_image = cv2.imread(str(FLAGS.input))
if input_image is None:
print(f"FAILED: could not load input image {str(FLAGS.input)}")
sys.exit(1)
input_image_buffer = preprocess(input_image, [FLAGS.width, FLAGS.height])
input_image_buffer = np.expand_dims(input_image_buffer, axis=0)
inputs[0].set_data_from_numpy(input_image_buffer)
print("Invoking inference...")
print(int(time.time()))
results = triton_client.infer(model_name=FLAGS.model,
inputs=inputs,
outputs=outputs)
result = results.as_numpy('boxes')
# print(f"Received result buffer of size {result.shape}")
# print(f"Naive buffer sum: {np.sum(result)}")
detected_objects = postprocess(result, input_image.shape[1], input_image.shape[0], [FLAGS.width, FLAGS.height], FLAGS.confidence, FLAGS.nms)
#print(f"Detected objects: {len(detected_objects)}")
print(int(time.time()))
I ran the performance analysis tool and these are the results
from yolov4-triton-tensorrt.
According to the perf_client everything is working as expected. I don't see where you measure 3 seconds round trip.
from yolov4-triton-tensorrt.
Please check the very first post in this issue, I pasted the logs in it.
Below is the output from the client.py script I sent above. As you can see it prints out the current time (epoch) right before it sends the request to the server for inference and then prints out the current time after it has gotten the result. The time difference is 3 seconds.
In the logs I sent in the first post, the very first line prints out
I0320 05:09:58.334630 1 grpc_server.cc:272] Process for ServerLive, rpc_ok=1, 3 step START
The very last line prints out
I0320 05:10:01.558696 1 grpc_server.cc:2419] Done for ModelInferHandler, 6
As you can see, it started at 05:09:58 and finished at 05:10:01 - a 3 second difference.
I don't believe the performance client is an accurate measurement, because I don't think it is sending an actual image. The command I used to run the perf client is as below
perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 4
from yolov4-triton-tensorrt.
The perf_client does send an actual image - you can even specify the image if you want. The problem definitely is in the client code. Please try the client.py from this repo and see if there is a difference. I'm unsure about the triton logs and what the timings mean in more detail.
from yolov4-triton-tensorrt.
It seems it was an error in the client - what is was exactly I am unsure of.
However, another problem seems to have come up - the detections from the model don't seem correct. The client seems to be looking for "detections". However, Trition only allows "boxes" and "confs" as the output now. I replaced "detections" in the client.py script with "boxes" and kept everything else the same. But when I did that, the detections were all messed up (see image below).
We tested this exact image with our model before Triton, with just TensorRT and it detected everything as expected, so it certainly cannot be because of the model.
from yolov4-triton-tensorrt.
No the output of the newest version of our networks is called "detections" and consists of just one array, please check again.
from yolov4-triton-tensorrt.
Related Issues (20)
- tritonclient.utils.InferenceServerException: [StatusCode.UNIMPLEMENTED] HOT 1
- yolov4-tiny model accuracy not right HOT 2
- Facing problem to create "engine" HOT 8
- C++ client produces no detections HOT 8
- multiple model instances issue HOT 15
- Feature: Import darknet weights instead of pytorch .wts HOT 1
- unexpected inference input 'data' HOT 2
- TensorRT 8 Support HOT 11
- Unexpected inference output 'detections' for model 'yolov4' HOT 1
- Dynamic batcing inference time HOT 1
- Support for TensorRT8 HOT 1
- [QUESTIONS] .wts file and plugin file HOT 3
- where is test time? HOT 1
- Can I use this repo to use custom trained yolo-v4 with single class HOT 1
- mismatch in postprocess func HOT 9
- How can I generate batch=5 engine? HOT 1
- error: creating server: Internal - failed to load all models - NVIDIA Triton Server for YOLOv4 HOT 1
- Error : ld cannot find -lcudart HOT 2
- tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] failed to connect to all addresses HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from yolov4-triton-tensorrt.