Giter Site home page Giter Site logo

haohaonju / centerpoint Goto Github PK

View Code? Open in Web Editor NEW
250.0 7.0 53.0 107.47 MB

TensorRT deployment for CenterPoint Lidar Detection Model.

License: MIT License

C++ 46.16% CMake 6.23% Python 37.45% C 0.62% Cuda 3.02% Shell 0.15% Common Lisp 4.54% JavaScript 1.83% HTML 0.01%
3d-detection deployment export-onnx object-tracking tensorrt

centerpoint's Introduction

CenterPoint : An Lidar Object Detection & Tracking project implemented by TensorRT

The project implement CenterPoint by TensorRT, where CenterPoint is an 3D object detection model using center points in the bird eye view. Code is written according to the project

Besides, it is running inference on WaymoOpenSet

Setup

The project has been tested on Ubuntu18.04 and Ubuntu20.04, It mainly relies on TensorRT and cuda as 3rd-party package, with the following versions respectively:

vTensorRT : 8.0.1.6

vCuda : 11.3

This project has provided the baseline onnx models trained with this config in models. If you want to export your own models, we assume you have had CenterPoint project installed, you can setup local det3d environment

cd /PATH/TO/centerpoint/tools 
bash setup3.sh

Preperation

Export as onnx models

To export your own models, you can run

python3 export_onnx.py \
--config waymo_centerpoint_pp_two_pfn_stride1_3x.py \
--ckpt your_model.pth \
--pfe_save_path pfe.onnx \
--rpn_save_path rpn.onnx

Here we extract two pure nn models from the whole computation graph---pfe and rpn, this is to make it easier for trt to optimize its inference engines, and we use cuda to connect these nn engines.

Generate TensorRT serialized engines

Actually you can directly create trt engines from onnx models and skip this step, however a more ideal way is to load your previously saved serialize engine files.

You can run

python3 create_engine.py \
--config waymo_centerpoint_pp_two_pfn_stride1_3x.py \
--pfe_onnx_path pfe.onnx \
--rpn_onnx_path rpn.onnx \
--pfe_engine_path pfe_fp.engine \
--rpn_engine_path rpn_fp.engine

By default this will generate fp16-engine files.

Work with int8

There are two ways to make quantization according to Nvidia: Explicit & Implicit Quantization

To make explicit quant, you can go to TensorRT/bin and make ./trtexec --onnx=model.onnx --int8 --saveEngine=model.engine you will need to compile tensorrt from source code

To make implicit quant, you need previously generate calibration files, we assume you have waymo_openset downloaded and have converted into the desired data formation according to this

python3 generate_calib_data.py \
--config waymo_centerpoint_pp_two_pfn_stride1_3x.py \
--ckpt your_model.pth \
--calib_file_path your_calib_files

Then refer to the code we provide by

python3 create_engine.py \
--config waymo_centerpoint_pp_two_pfn_stride1_3x.py \
--pfe_onnx_path pfe.onnx \
--rpn_onnx_path rpn.onnx \
--pfe_engine_path pfe_quant.engine \
--rpn_engine_path rpn_quant.engine \
--quant \
--calib_file_path your_calib_files \
--calib_batch_size 10

Run inference

After preperation, you may then build tensorrt project by executing the following commands:

cd /PATH/TO/centerpoint
mkdir build && cd build
cmake .. && make

If you want to create engine from onnx files, you can do infer by

./build/centerpoint \
--pfeOnnxPath=models/pfe_baseline32000.onnx \
--rpnOnnxPath=models/rpn_baseline.onnx \
--savePath=results \
--filePath=/PATH/TO/DATA \
--fp16

Or load engine files directly

./build/centerpoint \
--pfeEnginePath=pfe_fp.engine \
--rpnEnginePath=rpn_fp.engine \
--savePath=results \
--filePath=/PATH/TO/DATA \
--loadEngine

where filePath refers to input bin files generated by tools/generate_input_data.py.

You can also download seq1 directly to fastly test your model, see here in Baidu Cloud Disk :

link:https://pan.baidu.com/s/1Ua9F3eFflA9Gckpa9U-1Eg  passwd:08s6

Computation Speed

Acceleration is the main aim we want to archieve, and therefore we do most of computation(including preprocess & postprocess) on GPU. The below table gives the average computation speed (by millisecond) of every computation module, and it is tested on RTX3080, with all the 39987 waymo validation samples. the below table summarizes the computation speed.

Preprocess PfeInfer VoxelAssign RpnInfer Postprocess
fp32+gpupre+gpupost 1.73 8.47 0.36 25.0 2.01
fp16+gpupre+gpupost 1.61 5.88 0.17 6.89 2.37
fp16+cpupre+gpupost 9.2 6.14 0.42 7.14 2.10
int8(minmax)+gpupre+gpupost 1.61 8.23 0.17 5.25 3.21
int8(entropy)+gpupre+gpupost 1.41 7.45 0.17 4.65 2.11
int8(explicit)+gpupre+gpupost 2.2 8.0 0.17 8.18 2.59

Note that fp16 or int8 may be mixed up with fp32, we have no control over which tensor shall be int8, fp16 or fp32, it's dominated by tensorrt. We can see that fp16 mode runs much faster than fp32 mode, and gpu preprocess runs much faster than that of cpu, because in cuda, we runs in a pointwise-multithread-way, while in cpu, points are preprocessed in a for-loop-manner.

Metrics

You can run cd tools && python3 waymo_eval.py --cpp_output --save_path ../results to compute evaluation metrics, we set score threshould as 0.2, 2D iou threshuold as [0.7,0.5] for vehicle and pedestrian as is shown in waymo openset, below we can see the evaluation results:

Vehicle_level2/mAP Vehichle_level2/mAPH vehicle_level2 [email protected] Pedestrian_level2/mAP Pedestrian_level2/mAPH Pedestrian_level2 [email protected]
fp32+cpupre+cpupost 0.7814 0.7240 0.3966 0.6837 0.5668 0.3739
fp32+gpupre+gpupost 0.8039 0.7947 0.5731 0.6723 0.5588 0.2310
fp16+gpupre+gpupost 0.8038 0.7945 0.5730 0.6671 0.5541 0.2301
int8+gpupre+gpupost 0.5827 0.5615 0.4061 0.0634 0.0456 0.0

Below are the old metrics, it is computed by 3D iou with threshould [0.5,0.5] for vehicle and pedestrian, however the conclusions are the same.

Vehicle_level2/mAP Vehichle_level2/mAPH vehicle_level2 [email protected] Pedestrian_level2/mAP Pedestrian_level2/mAPH Pedestrian_level2 [email protected]
torchModel 0.6019 0.5027 0.0241 0.5545 0.5377 0.0547
fp32+cpupre+gpupost 0.6019 0.5027 0.0241 0.5546 0.5378 0.0548
fp16+cpupre+gpupost 0.6024 0.5030 0.0240 0.5545 0.5378 0.0542
fp16+gpupre+gpupost 0.6207 0.5173 0.2327 0.5788 0.5624 0.2984
int8(minmax)+gpupre+gpupost 0.3470 0.2889 0.0 0.3222 0.3065 0.0
int8(entropy)+gpupre+gpupost 0.1396 0.1049 0.0014 0.1550 0.1452 0.0008
int8(explicit)+gpupre+gpupost 0.4642 0.3823 0.0288 0.4248 0.4112 0.0201

From the above metrics, we can see that

  1. fp16 model almostly performs as well as fp32 model, despite it runs much faster.
  2. float model+cpupre+gpupost achieves the same result as original torch model, because the way they preprocess points are same, while gpu preprocess performs better. Because points in multithread are orderless, for a voxel that contains more points than given threshould, it makes unbiased subsampling.detailed reason see here
  3. int8 mode can't achieve the same results as float mode, and implicit calibration performs worse than explicit calib, we choose 1000 samples to make calibration and batch size is set 10, maybe samples are insufficient. However explicit calib model runs slower than implicit ones.

Online Tracking and Visualization

We use rivz to visualize perception results, to setup catkin workspace, you can go to tools/catkin_ws, run

bash catkin_make.sh
source devel/setup.bash

You need to config your file paths in tools/catkin_ws/src/waymo_track/src/waymo_track.py. Then you can open another two terminals, one type in roscore, another one type in rviz to show results(you can directly load default.rviz in catkin_ws), in your original terminal, go to catkin_ws, run rosrun waymo_track waymo_track.py, you should see the detection or tracking results in rviz window.

Detection & tracking result shows below:

float detection & tracking

implicit quantization & explicit quantization (tracking)

What has been done?

To futher learn the detailed computation graph of CenterPoint, please refer to the following picture. graph

Acknowledgements

This project refers to some codes from:

CenterPoint

TensorRT

CenterPoint-PointPillars

SORT

Contact

Hao Wang by [email protected]

centerpoint's People

Contributors

haohaonju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

centerpoint's Issues

w, l, h order is incorrect

Wouldn't you change ordering of width, length, height in the following:

box.l = host_boxes[i + 3 * boxSizeAft];

to:

            box.w = host_boxes[i +  3 * boxSizeAft];    // dx
            box.l = host_boxes[i + 4 * boxSizeAft];     // dy
            box.h = host_boxes[i + 5 * boxSizeAft];     // dz

inference time about int8 pfe

Thank you for your excellent work, @Abraham423

In the Computation Speed session of README.md, I noticed that int8 mode doesn't run faster than fp32/fp16 mode for pfe module.

Do you know what is the reason?

How do we toggle GPU preprocessing?

[09/17/2023-00:43:46] [I] Average PreProcess Time: 15.7913 ms
[09/17/2023-00:43:46] [I] Average PfeInfer Time: 9.98707 ms
[09/17/2023-00:43:46] [I] Average ScatterInfer Time: 0.410112 ms
[09/17/2023-00:43:46] [I] Average RpnInfer Time: 18.2958 ms
[09/17/2023-00:43:46] [I] Average PostProcess Time: 3.27232 ms

These are my times as seen above. Based on the times, it seems like the CPU is handling preprocessed. How do I ensure the GPU is handling the preprocessing?

Why two subgraphs instead of one whole graph

In the README, it says "Here we extract two pure nn models from the whole computation graph---pfe and rpn, this is to make it easier for trt to optimize its inference engines, and we use cuda to connect these nn engines."

Is there any repo/docu/link/tutorial that supports this argument? i.e., why is it easier for trt to optimize its inference engines? (one onnx vs two onnxs)

Is this an error in bin file creation?

First of all, thank you for your project.

I converted waymo dataset to pkl file according to this guide.
And I created a bin file with your "generate_input_data.py".

When I used these bin files, the detection result did not reach your sample result.
Again, the detection results of your samples (seq_0_frame_101.bin and seq_0_frame_100.bin) were very good.

Is this an error in bin file creation?
If my PC environment was a problem, the detection results of seq_0_frame_101.bin and seq_0_frame_100.bin would not be as good as the sample.

The environment in which I created the bin file is as follows.
And I attach the bin file and the result that I used.

  • dataset: waymo open dataset v1.2.0
  • Waymo-open-dataset devkit: waymo-open-dataset-tf-2.11.0==1.5.0
  • Command
    ./build/centerpoint --pfeOnnxPath=models/pfe_baseline32000.onnx --rpnOnnxPath=models/rpn_baseline.onnx --savePath=results --filePath=lidars --fp16

Screenshot from 2023-05-03 17-31-01

seq_0_frame_197.bin.txt
seq_0_frame_197.zip

Compared with pointpillar

Hi, I'm a new user of TensroRT. I wonder how much mAP centerpoint-pointpillar increase and how much FPS it decrease compared to pointpillar.

How to evaluate the mAP

Hello,
Firstly, thanks for your wonderful work, I managed to reproduce the test results.
image
I have a small question: in your readme, there are references to mAP data, but I can't find anything about it in this project. How can I figure out the mAP?
Thank you!

ONNX model incorrect!

This your pfe onnx model visual dispaly.
pfe_baseline32000 onnx
This my pfe onnx model visual dispaly
image
I think my model is not correct. Why did my action produce an incorrect model?

ONNX creation

How did you obtain those onnx files? Could you please provide guidance on that?

export_onnx set to true

why do you skip most part in forward function when export_onnx set to true? won't it affect the result?

Segmentation fault (core dumped)

(centerpoint) yixin@yixin:~/Desktop/CenterPoint$ python3 tools/export_onnx.py \

--config tools/waymo_centerpoint_pp_two_pfn_stride1_3x.py
--ckpt tools/centerpoint_pp_36.pth
--pfe_save_path /home/yixin/Desktop/CenterPoint/models/pfezyx.onnx
--rpn_save_path /home/yixin/Desktop/CenterPoint/models/rpnzyx.onnx
Import spconv fail, no support for sparse convolution!
iou3d cuda not built. You don't need this if you use circle_nms. Otherwise, refer to the advanced installation part to build this cuda extension
no apex
2022-06-18 23:18:09.004104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Segmentation fault (core dumped)

rtx3060 tensort 8.0.1 cuda 11.0

Original checkpoint.pth file

Hello,
I have successfully run your CenterPoint TensorRT project with great results and would like to thank you from the bottom of my heart for sharing your excellent work!

As the waymo dataset is too large and I don't have a GPU device at hand that can support the training samples, could you please share your checkpoint.pth file in your busy schedule, I would like to learn to run through the steps of exporting to onnx and test the effect of porting your project to run on our lab AI development board.

Thanks!

why did my detection output so weak unlike fp_det.gif showed?

I coverted default onnx model to tensorrt, and using waymo newest dataset,
which is "individual_files_validation_segment-10203656353524179475_7625_000_7645_000_with_camera_labels.tfrecord".
my result are shown as below
image
few or no detection box.
and your result showed good detection result as below
image
Could you please show me what cause this different result?

What is the input for `tools/generate_input_data.py`

As mentioned in title, when I run inference, I need give the parameter --filePath=/PATH/TO/DATA. However, you see filePath refers to input bin files generated by tools/generate_input_data.py. So, what is the input data for tools/generate_input_data.py.

Looks for your reply!
Thx!

About velocity

Hi, I wonder how you obtain velocity in your implementation. I read the paper and it states that results from two time-steps are compared to calculate the velocity but I don't see it in your code. Thanks.

Paper statement: velocity estimate is special, as it requires two input map-views the current and previous time-step. It predicts
the difference in object position between the current and the past frame.

3d tracker sort

Will you provide C++ implemention of 3d sort tracker? Thank you.

About box visualization

I want to test your model on the sample bin data you provided in the repo. I am trying to visualize the prediction boxes using open3d library but boxes seem incorrect such that person box in blue color is lying horizontally like car while car boxes in red color is displayed correctly. I used the following code to visualize results:

annotation_file = open('result/seq_0_frame_100.bin.txt', 'r')
boxes = []
lines = annotation_file.readlines()
for line in lines:
    tokens = line.split()
    x = float(tokens[0])
    y = float(tokens[1])
    z = float(tokens[2])
    h = float(tokens[3])
    w = float(tokens[4])
    l = float(tokens[5])
    velX = float(tokens[6])
    velY = float(tokens[7])
    theta = float(tokens[8])
    score = float(tokens[9])
    cls = int(tokens[10])
    if score > 0.2:
        boxes.append([x,y,z,h,w,l,theta,cls])
        
        #box = [h,w,l,x,y,z,rot]
def roty(t):
    """
    Rotation about the y-axis.
    """
    c = np.cos(t)
    s = np.sin(t)
    return np.array([[c, 0, s],
                     [0, 1, 0],
                     [-s, 0, c]])

def box_center_to_corner(box):
    translation = box[0:3]
    h, w, l = box[3], box[4], box[5]
    #if the angle value is in radian then use below mentioned conversion
    #rotation_y = box[6]
    #rotation = rotation_y * (180/math.pi)                             #rad to degree
    rotation = box[6]

    # Create a bounding box outline if x,y,z is center point then use defination bounding_box as mentioned below
    bounding_box = np.array([
        [-l/2, -l/2, l/2, l/2, -l/2, -l/2, l/2, l/2],
        [w/2, -w/2, -w/2, w/2, w/2, -w/2, -w/2, w/2],
        [-h/2, -h/2, -h/2, -h/2, h/2, h/2, h/2, h/2]])
                      

    # Standard 3x3 rotation matrix around the Z axis
    rotation_matrix = np.array([
        [np.cos(rotation), -np.sin(rotation), 0.0],
        [np.sin(rotation), np.cos(rotation), 0.0],
        [0.0, 0.0, 1.0]])

    # Repeat the [x, y, z] eight times
    eight_points = np.tile(translation, (8, 1))

    # Translate the rotated bounding box by the
    # original center position to obtain the final box
    corner_box = np.dot(rotation_matrix, bounding_box) + eight_points.transpose()

    return corner_box.transpose()
    
    entities_to_draw = []
    
for box in boxes:
    boxes3d_pts = box_center_to_corner(box)
    boxes3d_pts = boxes3d_pts.T
    boxes3d_pts = o3d.utility.Vector3dVector(boxes3d_pts.T)
    box3d = o3d.geometry.OrientedBoundingBox.create_from_points(boxes3d_pts)
    if box[-1] == 0:
        box3d.color = [1, 0, 0]           #Box color would be red box.color = [R,G,B]
    elif box[-1] == 1:
        box3d.color = [0, 0, 1]
    else:
        box3d.color = [0, 1, 0]
    entities_to_draw.append(box3d)

I will appreciate it if you provide a feedback on this issue. Thank you.
image

PostProcess time is too long

Hello,I tested your postprocessGPU function.the result is 60+ms! in nvidia A6000, then I test cuda fucntion "sort_by_key" and "_raw_nms_gpu",it costs average 7.5ms "sort_by_key" every taskidx and 1~7ms "_raw_nms_gpu",so why it costs so long time

dim error when infer by Tensorrt

[01/12/2023-10:56:18] [I] [TRT] Successfully created plugin: ScatterND
[01/12/2023-10:56:18] [E] [TRT] MatMul_177: last dimension of input0 = 15 and second to last dimension of input1 = 10 but must match.
[01/12/2023-10:56:18] [E] [TRT] ModelImporter.cpp:720: While parsing node number 177 [MatMul -> "195"]:
[01/12/2023-10:56:18] [E] [TRT] ModelImporter.cpp:721: --- Begin node ---
[01/12/2023-10:56:18] [E] [TRT] ModelImporter.cpp:722: input: "193"
input: "229"
output: "195"
name: "MatMul_177"
op_type: "MatMul"

[01/12/2023-10:56:18] [E] [TRT] ModelImporter.cpp:723: --- End node ---

Question About PointDim Waymo

Hi, I'm new to point cloud processing. I want to know why is the point dim here is 5? Is it x,y,z,i and what is the 5th is it elongation?
And do you have idea what to put in feature if the lidar dont provide elongation?

Thanks

cudaErrorInvalidDeviceFunction

Hello, in samplecenterpoint.cpp, I have set params.load_engine = false,
and provided the onnx file path by setting params.pfeOnnxFilePath and params.rpnOnnxFilePath,
compiling succeeded, but when I ran command "./centerpoint", something went wrong.
Please see below:

&&&& RUNNING TensorRT.sample_onnx_centerpoint [TensorRT v8001] # ./centerpoint
[04/11/2022-09:12:41] [I] Building and running a GPU inference engine for CenterPoint
[04/11/2022-09:12:41] [I] Building pfe engine . . .  
[04/11/2022-09:12:42] [I] [TRT] [MemUsageChange] Init CUDA: CPU +149, GPU +0, now: CPU 162, GPU 585 (MiB)
[04/11/2022-09:12:42] [I] ConstructNetwork !
[04/11/2022-09:12:42] [I] [TRT] ----------------------------------------------------------------
[04/11/2022-09:12:42] [I] [TRT] Input filename:   /home/wang/CenterPointTensorRT/pfe_baseline32000.onnx
[04/11/2022-09:12:42] [I] [TRT] ONNX IR version:  0.0.6
[04/11/2022-09:12:42] [I] [TRT] Opset version:    11
[04/11/2022-09:12:42] [I] [TRT] Producer name:    pytorch
[04/11/2022-09:12:42] [I] [TRT] Producer version: 1.9
[04/11/2022-09:12:42] [I] [TRT] Domain:           
[04/11/2022-09:12:42] [I] [TRT] Model version:    0
[04/11/2022-09:12:42] [I] [TRT] Doc string:       
[04/11/2022-09:12:42] [I] [TRT] ----------------------------------------------------------------
[04/11/2022-09:12:42] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/11/2022-09:12:42] [I] [TRT] MatMul_0: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,10][NONE] dims(input1)=[1,10,32][NONE].
[04/11/2022-09:12:42] [I] [TRT] MatMul_0: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,10][NONE] dims(input1)=[1,10,32][NONE].
[04/11/2022-09:12:42] [I] [TRT] MatMul_0: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,10][NONE] dims(input1)=[1,10,32][NONE].
[04/11/2022-09:12:43] [I] [TRT] MatMul_0: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,10][NONE] dims(input1)=[1,10,32][NONE].
[04/11/2022-09:12:43] [I] [TRT] MatMul_18: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,64][NONE] dims(input1)=[1,64,64][NONE].
[04/11/2022-09:12:43] [I] [TRT] MatMul_0: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,10][NONE] dims(input1)=[1,10,32][NONE].
[04/11/2022-09:12:43] [I] [TRT] MatMul_18: broadcasting input1 to make tensors conform, dims(input0)=[32000,20,64][NONE] dims(input1)=[1,64,64][NONE].
[04/11/2022-09:12:43] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 162 MiB, GPU 585 MiB
[04/11/2022-09:12:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +215, GPU +85, now: CPU 377, GPU 670 (MiB)
[04/11/2022-09:12:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +175, GPU +90, now: CPU 552, GPU 760 (MiB)
[04/11/2022-09:12:45] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[04/11/2022-09:12:55] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[04/11/2022-09:12:55] [I] [TRT] Total Host Persistent Memory: 928
[04/11/2022-09:12:55] [I] [TRT] Total Device Persistent Memory: 0
[04/11/2022-09:12:55] [I] [TRT] Total Scratch Memory: 0
[04/11/2022-09:12:55] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 256 MiB
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 555, GPU 768 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 555, GPU 778 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 555, GPU 762 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 555, GPU 746 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 555 MiB, GPU 746 MiB
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 555, GPU 742 (MiB)
[04/11/2022-09:12:55] [I] Create ICudaEngine  !
[04/11/2022-09:12:55] [I] [TRT] Loaded engine size: 0 MB
[04/11/2022-09:12:55] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 555 MiB, GPU 742 MiB
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 555, GPU 750 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 555, GPU 758 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 555, GPU 742 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 555 MiB, GPU 742 MiB
[04/11/2022-09:12:55] [I] getNbInputs: 1 

[04/11/2022-09:12:55] [I] getNbOutputs: 1 

[04/11/2022-09:12:55] [I] getNbOutputs Name: 47 

[04/11/2022-09:12:55] [I] Building rpn engine . . .  
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 555, GPU 742 (MiB)
[04/11/2022-09:12:55] [I] ConstructNetwork !
[04/11/2022-09:12:55] [I] [TRT] ----------------------------------------------------------------
[04/11/2022-09:12:55] [I] [TRT] Input filename:   /home/wang/CenterPointTensorRT/rpn_baseline.onnx
[04/11/2022-09:12:55] [I] [TRT] ONNX IR version:  0.0.6
[04/11/2022-09:12:55] [I] [TRT] Opset version:    10
[04/11/2022-09:12:55] [I] [TRT] Producer name:    pytorch
[04/11/2022-09:12:55] [I] [TRT] Producer version: 1.9
[04/11/2022-09:12:55] [I] [TRT] Domain:           
[04/11/2022-09:12:55] [I] [TRT] Model version:    0
[04/11/2022-09:12:55] [I] [TRT] Doc string:       
[04/11/2022-09:12:55] [I] [TRT] ----------------------------------------------------------------
[04/11/2022-09:12:55] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[04/11/2022-09:12:55] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 572 MiB, GPU 742 MiB
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 575, GPU 750 (MiB)
[04/11/2022-09:12:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 575, GPU 758 (MiB)
[04/11/2022-09:12:55] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[04/11/2022-09:13:11] [F] [TRT] [virtualMemoryBuffer.cpp::resizePhysical::79] Error Code 2: OutOfMemory (no further information)
[04/11/2022-09:13:11] [F] [TRT] [virtualMemoryBuffer.cpp::resizePhysical::65] Error Code 2: OutOfMemory (no further information)
[04/11/2022-09:13:11] [W] [TRT] -------------- The current system memory allocations dump as below --------------
[0x55b6e4bd5350]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2151 time: 1.127e-06
[0x55b6e4bd20d0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2148 time: 9.33e-07
[0x55b6e4bd1d30]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2145 time: 8e-07
[0x55b6e4bd1b60]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2142 time: 1.128e-06
[0x55b6dab9b5b0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2139 time: 2.12e-07
[0x55b6dab9b2a0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2136 time: 1.91e-07
[0x55b6db2f3740]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 27 time: 7.9e-08
[0x55b6db03be30]:1280 :Conv Aspect merge bias in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 33 time: 8.2e-07
[0x55b6db041330]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1067 time: 5.7e-07
[0x55b6db01ed60]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 24 time: 7.5e-08
[0x55b6db4a7750]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 13 time: 9.5e-08
[0x55b6db0eb700]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 21 time: 7.8e-08
[0x55b6db071d40]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 22 time: 7.3e-08
[0x55b6c16af680]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 15 time: 7.5e-08
[0x55b6db075390]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 12 time: 9.7e-08
[0x55b6e4bcd490]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 2133 time: 1.45e-07
[0x55b6dab0a390]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 35 time: 6.95e-07
[0x55b6c15cfe90]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 17 time: 7.7e-08
[0x55b6db0af810]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 25 time: 8.1e-08
[0x55b6db2f5830]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 11 time: 8.1e-08
[0x55b6dab8b1c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 50 time: 1.311e-06
[0x55b6dab99760]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 422 time: 2.69e-07
[0x55b6db0aa730]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 26 time: 7.9e-08
[0x55b6e4bca430]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1750 time: 1.45e-07
[0x55b6db2da8c0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 28 time: 7.4e-08
[0x55b6dab9b970]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 428 time: 8.35e-07
[0x55b6e3d0ed20]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 749 time: 2.53e-07
[0x55b6db4dad80]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 5 time: 7.5e-08
[0x55b6d9e8b930]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 7 time: 8.5e-08
[0x55b6db06e7d0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 9 time: 8.3e-08
[0x55b6e4bca5c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1444 time: 1.75e-07
[0x55b6db1218c0]:262144 :Layer Aspects merge kernel weights in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 30 time: 3.11e-06
[0x55b6db03d3a0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 3 time: 3.2e-08
[0x55b6da957780]:737280 :Conv Aspect merge weights in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 32 time: 5.72e-07
[0x55b6da70fff0]:2097152 :Layer Aspects merge kernel weights in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 31 time: 5.51e-07
[0x55b6db066b60]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 10 time: 7e-08
[0x55b6dab9bbe0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 431 time: 7.78e-07
[0x55b6db2d9290]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 14 time: 8.3e-08
[0x55b6e3d0efa0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 752 time: 1.51e-07
[0x55b6db0429c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1076 time: 9.48e-07
[0x55b6db2f3980]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 29 time: 9e-08
[0x55b6daf94d50]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 18 time: 9.5e-08
[0x55b6ceefdff0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 6 time: 7.1e-08
[0x55b6e4bce890]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1762 time: 7.17e-07
[0x55b6db03cea0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 0 time: 1.12e-07
[0x55b6dab8a040]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 44 time: 6.91e-07
[0x55b6db07de90]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 4 time: 1.24e-07
[0x55b6db30bc10]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 1 time: 5.5e-08
[0x55b6dab89e50]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 38 time: 9.59e-07
[0x55b6dab8b500]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 41 time: 1.187e-06
[0x55b6dab8a630]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 47 time: 7.69e-07
[0x55b6dab8b360]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 53 time: 6.13e-07
[0x55b6db4774d0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 16 time: 7.9e-08
[0x55b6dab8acb0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 416 time: 2.23e-07
[0x55b6e4bca520]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1441 time: 2.25e-07
[0x55b6dab9ad70]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1753 time: 2.01e-07
[0x55b6c16c4510]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 20 time: 9.9e-08
[0x55b6e4bcdb70]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1765 time: 8.13e-07
[0x55b6db2f28c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 419 time: 1.79e-07
[0x55b6e3d0d790]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 425 time: 8.89e-07
[0x55b6db452960]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 23 time: 8.3e-08
[0x55b6e3d0eb30]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 434 time: 3.45e-07
[0x55b6e3d0edc0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 755 time: 1.42e-07
[0x55b6db0422f0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 764 time: 6.61e-07
[0x55b6e4bccf60]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1759 time: 1.113e-06
[0x55b6db041ca0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 758 time: 1.117e-06
[0x55b6e3d0f140]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 761 time: 6.96e-07
[0x55b6e4bc97a0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1447 time: 1.69e-07
[0x55b6db042410]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 767 time: 6.52e-07
[0x55b6e4bc9a30]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1450 time: 5.76e-07
[0x55b6db03d350]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 2 time: 3.1e-08
[0x55b6db0411f0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1058 time: 1.08e-07
[0x55b6e4bc9bd0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1453 time: 6.98e-07
[0x55b6db041580]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1061 time: 1.76e-07
[0x55b6db041290]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1064 time: 1.61e-07
[0x55b6db458e40]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 19 time: 1.02e-07
[0x55b6e4bbd490]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1073 time: 1.34e-07
[0x55b6e4bc9f90]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1456 time: 7.06e-07
[0x55b6e4bca100]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1459 time: 5.64e-07
[0x55b6dab9b340]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1756 time: 1.67e-07
[0x55b6e4bcdd10]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1768 time: 7.89e-07
[0x55b6db0ea9e0]:4 :: weight scales in internalAllocate: at runtime/common/weightsPtr.cpp: 100 idx: 8 time: 9.3e-08
[0x55b6e3d0ef00]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 166 idx: 1070 time: 1.33e-07
-------------- The current device memory allocations dump as below --------------
[0]:4294967296 :HybridGlobWriter in reserveRegion: at optimizer/common/globWriter.cpp: 246 idx: 6 time: 0.000150233
[0x5021d6200]:18432 :GpuGlob deserialization in load: at runtime/deserialization/safeDeserialize.cpp: 349 idx: 4 time: 1.4569e-05
[0x504200000]:134217728 :HybridGlobWriter in reserveRegion: at optimizer/common/globWriter.cpp: 246 idx: 5 time: 0.000363279
[04/11/2022-09:13:11] [E] [TRT] Requested amount of GPU memory (4294967296 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[04/11/2022-09:13:11] [W] [TRT] Skipping tactic 2 due to oom error on requested size of 4294967296 detected for tactic 2.
Try decreasing the workspace size with IBuilderConfig::setMaxWorkspaceSize().
[04/11/2022-09:13:46] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[04/11/2022-09:13:46] [I] [TRT] Total Host Persistent Memory: 17536
[04/11/2022-09:13:46] [I] [TRT] Total Device Persistent Memory: 51033088
[04/11/2022-09:13:46] [I] [TRT] Total Scratch Memory: 140175360
[04/11/2022-09:13:46] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 5 MiB, GPU 4224 MiB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 753, GPU 881 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 753, GPU 889 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 753, GPU 873 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 752, GPU 857 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 752 MiB, GPU 857 MiB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 800, GPU 805 (MiB)
[04/11/2022-09:13:46] [I] Create ICudaEngine  !
[04/11/2022-09:13:46] [I] [TRT] Loaded engine size: 51 MB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 800 MiB, GPU 805 MiB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 800, GPU 864 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 801, GPU 872 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 800, GPU 856 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 800 MiB, GPU 856 MiB
[04/11/2022-09:13:46] [I] getNbInputs: 1 

[04/11/2022-09:13:46] [I] getNbOutputs: 6 

[04/11/2022-09:13:46] [I] getNbOutputs Name: 246 

[04/11/2022-09:13:46] [I] All has Built !  
[04/11/2022-09:13:46] [I] Creating pfe context 
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 761 MiB, GPU 889 MiB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 761, GPU 897 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 761, GPU 905 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 761 MiB, GPU 1299 MiB
[04/11/2022-09:13:46] [I] Creating rpn context 
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 823 MiB, GPU 1362 MiB
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 823, GPU 1370 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 823, GPU 1378 (MiB)
[04/11/2022-09:13:46] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 823 MiB, GPU 2109 MiB
===========FilePath[0/10]:../../lidars/seq_0_frame_100.bin==============
[04/11/2022-09:13:47] [I] [INFO] pointNum : 177125
Success to read and Point Num  Is: 177125
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
Aborted (core dumped)

Computation Speed on GTX 1080

Hi Abraham,

First of all, thank you for your excellent work.

GPU: GTX 1080
GPU driver: 460.91.03
Cuda: 11.1
tensorRT: 8.0.1

I got that error:
The engine plan file is generated on an incompatible device, expecting compute 6.1 got compute 8.6, please rebuild.

I rebuild engine files according to #3 .

code output:
[01/26/2022-08:11:39] [I] Building and running a GPU inference engine for CenterPoint
[01/26/2022-08:11:39] [I] Building pfe engine . . .
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +153, GPU +0, now: CPU 168, GPU 198 (MiB)
[01/26/2022-08:11:39] [I] Create ICudaEngine !
[01/26/2022-08:11:39] [I] [TRT] Loaded engine size: 0 MB
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 168 MiB, GPU 198 MiB
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.3.0
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +183, GPU +74, now: CPU 351, GPU 272 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +179, GPU +70, now: CPU 530, GPU 342 (MiB)
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 529, GPU 326 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 529 MiB, GPU 326 MiB
[01/26/2022-08:11:39] [I] Building rpn engine . . .
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 580, GPU 326 (MiB)
[01/26/2022-08:11:39] [I] Create ICudaEngine !
[01/26/2022-08:11:39] [I] [TRT] Loaded engine size: 51 MB
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 580 MiB, GPU 326 MiB
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.3.0
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 581, GPU 386 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 581, GPU 394 (MiB)
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 581, GPU 378 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 581 MiB, GPU 378 MiB
[01/26/2022-08:11:39] [I] All has Built !
[01/26/2022-08:11:39] [I] Creating pfe context
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 560 MiB, GPU 412 MiB
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.3.0
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 560, GPU 420 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 560, GPU 428 (MiB)
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 560 MiB, GPU 824 MiB
[01/26/2022-08:11:39] [I] Creating rpn context
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 622 MiB, GPU 888 MiB
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.3.0
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 622, GPU 896 (MiB)
[01/26/2022-08:11:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 622, GPU 906 (MiB)
[01/26/2022-08:11:39] [W] [TRT] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.0.5
[01/26/2022-08:11:39] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 622 MiB, GPU 1638 MiB
===========FilePath[0/10]:../../lidars/seq_0_frame_100.bin==============
[01/26/2022-08:11:39] [I] [INFO] pointNum : 177125
Success to read and Point Num Is: 177125
Num boxes before 1315
Num boxes after 500
===========FilePath[1/10]:../../lidars/seq_0_frame_101.bin==============
[01/26/2022-08:11:39] [I] [INFO] pointNum : 175893
Success to read and Point Num Is: 175893
Num boxes before 1372
Num boxes after 500
===========FilePath[2/10]:../../lidars/seq_0_frame_102.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 177130
Success to read and Point Num Is: 177130
Num boxes before 1483
Num boxes after 500
===========FilePath[3/10]:../../lidars/seq_0_frame_103.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 176870
Success to read and Point Num Is: 176870
Num boxes before 1400
Num boxes after 500
===========FilePath[4/10]:../../lidars/seq_0_frame_104.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 174146
Success to read and Point Num Is: 174146
Num boxes before 1503
Num boxes after 500
===========FilePath[5/10]:../../lidars/seq_0_frame_105.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 172962
Success to read and Point Num Is: 172962
Num boxes before 1481
Num boxes after 500
===========FilePath[6/10]:../../lidars/seq_0_frame_106.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 172704
Success to read and Point Num Is: 172704
Num boxes before 1367
Num boxes after 500
===========FilePath[7/10]:../../lidars/seq_0_frame_107.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 171648
Success to read and Point Num Is: 171648
Num boxes before 1429
Num boxes after 500
===========FilePath[8/10]:../../lidars/seq_0_frame_108.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 170759
Success to read and Point Num Is: 170759
Num boxes before 1376
Num boxes after 500
===========FilePath[9/10]:../../lidars/seq_0_frame_109.bin==============
[01/26/2022-08:11:40] [I] [INFO] pointNum : 168130
Success to read and Point Num Is: 168130
Num boxes before 1404
Num boxes after 500
[01/26/2022-08:11:40] [I] Average PreProcess Time: 6.97109 ms
[01/26/2022-08:11:40] [I] Average PfeInfer Time: 30.7645 ms
[01/26/2022-08:11:40] [I] Average ScatterInfer Time: 0.374614 ms
[01/26/2022-08:11:40] [I] Average RpnInfer Time: 54.2483 ms
[01/26/2022-08:11:40] [I] Average PostProcess Time: 7.86936 ms
[01/26/2022-08:11:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 627, GPU 1572 (MiB)
[01/26/2022-08:11:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 565, GPU 810 (MiB)
&&&& PASSED TensorRT.sample_onnx_centerpoint [TensorRT v8001] # ./centerpoint
Free Variables .

I wonder if average times are ok. What do you think?

convert model core dumped

hello:
convert model from onnx to engine:pfe can convert, but fpn can not convert.
[03/23/2023-15:04:59] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[03/23/2023-15:04:59] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[03/23/2023-15:04:59] [TRT] [W] Check verbose logs for the list of affected weights.
[03/23/2023-15:04:59] [TRT] [W] - 41 weights are affected by this issue: Detected subnormal FP16 values.
[03/23/2023-15:04:59] [TRT] [W] - 21 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
deserialize the engine . . .
[03/23/2023-15:04:59] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
context_rpn <tensorrt.tensorrt.IExecutionContext object at 0x7f77521fa458>
Segmentation fault (core dumped)

run C++ tensorRT: the fpn also core dumped

Cuda Runtime (context is destroyed)

Thanks for your great work!
I got this error when I tried to generate TRT engine using create_engine.py. I am using cuda 11.3 and TRT 8.0.1.6. Any suggestions?

root@desktop:/home/CenterPoint/tools# python3 create_engine.py         --config waymo_centerpoint_pp_two_pfn_stride1_3x.py         --pfe_onnx_path ../models/pfe_baseline32000.onnx         --rpn_onnx_path ../mod
els/rpn_baseline.onnx         --pfe_engine_path pfe.engine         --rpn_engine_path rpn.engine;                                                                                                                                                              
[TensorRT] WARNING: onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.                                                                          
building pfe trt engine . . .                                                                                                                                                                                                                                 
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
[TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead                                                                                                                                                                                
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
deserialize the engine . . .                                                                                                                                                                                                                                  
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
context_pfe <tensorrt.tensorrt.IExecutionContext object at 0x7fee2f8ba8b0>                                                                                                                                                                                    
[TensorRT] WARNING: The logger passed into createInferBuilder differs from one already provided for an existing builder, runtime, or refitter. TensorRT maintains only a single logger pointer at any given time, so the existing value, which can be retrieve
d with getLogger(), will be used instead. In order to use a new logger, first destroy all existing builder, runner or refitter objects.                                                                                                                       
                                                                                                                                                                                                                                                              
[TensorRT] WARNING: Tensor DataType is determined at build time for tensors not marked as input or output.                                                                                                                                                    
building rpn trt engine . . .                                                                                                                                                                                                                                 
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
[TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead                                                                                                                                                                                
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
deserialize the engine . . .                                                                                                                                                                                                                                  
[TensorRT] WARNING: The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. TensorRT maintains only a single logger pointer at any given time, so the existing value, which can be retrieve
d with getLogger(), will be used instead. In order to use a new logger, first destroy all existing builder, runner or refitter objects.                                                                                                                       
                                                                                                                                                                                                                                                              
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 11.4.2                                                                                                                                                    
[TensorRT] WARNING: TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0                                                                                                                                                                            
context_rpn <tensorrt.tensorrt.IExecutionContext object at 0x7fee2f8baa70>                                                                                                                                                                                    
[TensorRT] ERROR: 1: [hardwareContext.cpp::terminateCommonContext::141] Error Code 1: Cuda Runtime (context is destroyed)                                                                                                                                     
Segmentation fault (core dumped)                                                                                      

Test with ouster lidar

I would like to test model on ouster lidar data. How can I set the following parameters? what is the difference between two X_MIN, and X_CENTER_MIN etc and what is X_STEP?

// pillar size 
#define X_STEP 0.32f
#define Y_STEP 0.32f
#define X_MIN -74.88f
#define X_MAX 74.88f
#define Y_MIN -74.88f
#define Y_MAX 74.88f
#define Z_MIN -2.0f
#define Z_MAX 4.0f

#define X_CENTER_MIN -80.0f
#define X_CENTER_MAX 80.0f
#define Y_CENTER_MIN -80.0f
#define Y_CENTER_MAX 80.0f
#define Z_CENTER_MIN -10.0f
#define Z_CENTER_MAX 10.0f

#define PI 3.141592653f
// paramerters for preprocess
#define BEV_W 468
#define BEV_H 468
#define MAX_PILLARS 32000 //20000 //32000
#define MAX_PIONT_IN_PILLARS 20
#define FEATURE_NUM 10
#define PFE_OUTPUT_DIM 64
#define THREAD_NUM 4

File Size Error! 1

pointNum : {51916}[01/18/2022-11:34:29] [E] [Error] File Size Error! 1038336
[01/18/2022-11:34:29] [I] [INFO] pointNum : 51916
Success to read and Point Num Is: 51916

error: 'class nvinfer1::IBuilder' has no member named 'buildSerializedNetwork'

i want to carry out the trt engin inference according to your opencode(https://github.com/Abraham423/CenterPoint)

but i can't make successfully.

when i run cmake .. && make
always get the following errors,can you give me a compiled docker image ? or tell me the details about how to solve it:

src/centerpoint.cpp:134:48: error: 'class nvinfer1::IBuilder' has no member named 'buildSerializedNetwork'

134 | SampleUniquePtr plan{builder->buildSerializedNetwork(*network, *config)};

  |                                                ^~~~~~~~~~~~~~~~~~~~~~

/data/CUDA-PointPillars-main/CenterPoint_int8/src/centerpoint.cpp:134:89: error: no matching function for call to 'std::unique_ptr<nvinfer1::IHostMemory, samplesCommon::InferDeleter>::unique_ptr()'

134 | SampleUniquePtr plan{builder->buildSerializedNetwork(*network, *config)};

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.