tencent / tpat Goto Github PK

View Code? Open in Web Editor NEW

364.0 13.0 42.0 49.85 MB

TensorRT Plugin Autogen Tool

License: Apache License 2.0

Python 91.68% Makefile 7.78% Dockerfile 0.53%

tpat's Introduction

TPAT - TensorRT Plugin Autogen Tool

Introduction

Automatically generate high-performance TensorRT plugins for unsupported operators or replacing inefficient kernels.
End-to-end command line tool. No requirement for any CUDA programming knowledge. Users only need to provide the ONNX model and assign the node names or types to auto-generate TensorRT plugin.
The performance of auto-generated TensorRT plugins in real cases:
- Performance comparation with hand-written kernels
- Optimization for TensorRT's original kernels

Support Matrix

ONNX Operators supported by TPAT-1.0

Runtime Env : dockerfile

1. Build image

nvidia-docker build .

2. Run container

nvidia-docker run -itd --gpus all -v <TPAT path dir>:/root <Image_ID> /bin/bash

3. Execute conrainer

nvidia-docker exec -it <Container_ID> /bin/bash

4. Modify CUDA_PATH and TRT_PATH in python/trt_plugin/Makefile

CUDA_PATH: local CUDA installation path
TRT_LIB_PATH: local TensorRT installation path

5. Plugin auto generated

cd examples
python test_onehot_dynamic_direct.py

tpat_onehot.so is stored in python/trt_plugin/lib/

Runtime Env : Build

1. Prerequisites

System Packages

LLVM >= 9.0.1, (LLVM==9.0.1 recommended)
GCC >= 7.3.0, (GCC==7.4.0 recommended)
TensorRT

PyPI packages

numpy pycuda onnx onnxruntime onnx_graphsurgeon xgboost jinja2 ctypes tornado cloudpickle psutil

NOTE: these necessary packages are recorded in requirements.txt

Optional packages

tensorflow-gpu==1.15
tf2onnx
torch
pytest

NOTE: these optional packages are required by Example and UnitTest

2. Clone the TPAT repository

git clone -b master https://github.com/nvidia/TensorRT TPAT
cd TPAT
git submodule update --init --recursive

3. Build BlazerML-TVM

mkdir build && cp cmake/config.cmake build
#Edit build/config.cmake to customize the compilation options
set(USE_LLVM /usr/local/llvm/bin/llvm-config)
set(USE_CUDA ON)
#gcc compiler is required to support C++14
cd build && cmake .. 
make -j
#TVM Python package
export TVM_HOME=/path/to/tvm
export PYTHONPATH=$TVM_HOME/python:${PYTHONPATH}

4. Plugin Compiler Env

Modify python/trt_plugin/Makefile according to your environment setup.

CUDA_PATH: local CUDA installation path
TRT_LIB_PATH: local TensorRT installation path

Usage

TPAT provides a Python function and command line for usage.

Python function

onnx2plugin(
	input_model_path, 
	output_model_path, 
	node_names=None, 
	node_types=None, 
	plugin_name_dict=None,
	dynamic_bs=False, # if True, this operator support dynamic batchsize
	min_bs=1,
	max_bs=256,
	opt_bs=128
	)

input_model_path[required] : input onnx model including nodes which require TRT plugin
output_model_path[required] : output onnx model where the corresponding node types are replaced by plugin names. The output onnx model can be directly converted to TRT with onnx parser and built plugin dynamic library.
node_names : list of node names for autogen
node_types : list of node types for autogen
plugin_name_dict : dict of {plugin_name: node_name} for autogen
dynamic_bs : if True, TPAT will generate plugin that supported dynamic batch, if False, generated plugin only support fixed shapes but has better performance.
min_bs: the minium batch size in range of dynamic batch.
max_bs: the maxium batch size in range of dynamic batch.
opt_bs: the optimize batch size in range of dynamic batch.

NOTE: For node_names, node_types, plugin_name_dict, at least one of them should be provided

Command line

# Separate different ops with spaces
python3 Onnx2Plugin.py -i input.onnx -o output.onnx -n op_name1 op_name2 -dynamic=true -min=1 -max=512 -opt=256
python3 Onnx2Plugin.py -i input.onnx -o output.onnx -t op_type1 op_type2 -dynamic=false
python3 Onnx2Plugin.py -i input.onnx -o output.onnx -p '{"op_name1": "plugin_name1", "op_name2": "plugin_name2"}'

-i[required]: input_model_path
-o[required]: output_model_path
-n: node_names
-t: node_types
-p: plugin_name_dict
-dynamic: dynamic_bs
-min: min_bs
-max: max_bs
-opt: opt_bs

Output

1. Assign nodes and plugin names through plugin_name_dict

trt_plugin/src contains {plugin_name}.cu and {plugin_name}.h
trt_plugin/lib contains {plugin_name}.so

2. Assign node names or node types

trt_plugin/src contains tpat_{node_name}.cu and tpat_{node_name}.h
trt_plugin/lib contains tpat_{node_name}.so

Example && UnitTest

Example : example_tensorflow.py
UnitTest : test_tapt.py

Release notes

Changelog

Support mutiple nodes for autogen
Support boolean input/outputs
Able to reuse plugins

Known issues

Only support dynamic BatchSize
Opeartors with int8/float16/double inputs/outputs are not supported

TODO

Support ONNX subgraph for autogen
Support direction conversion from TensorFlow and PyTorch

tpat's People

Contributors

Stargazers

Watchers

Forkers

luomor-ai wdhao jeshxxx dl19940602 201419 lswzjuer buptqq heluocs mindobserver zjj-2015 so2bin wm2012011492 talhausuf isabella232 finemax hariag zhj-buffer zzm37 dingjingzhen hacunamatada shangdehao1 abdoujaouhar debrekxuhan briandbl deisler134 zhouleidcc simengliu-nv spectreprediction zqsteven octopusbrolau sanbuphy jiangzongkang lem89757 cekcoco xinsuinizhuan salary-only-17k jie311 datomi79 tream733 autra-weiliu bo-scnu

tpat's Issues

Does TPAT support grid_sample?

Does TPAT support grid_sample, and if not, when will it support it?

Is there a talk or article about of the implementation of this project?

Just for study and insterest. Thanks a lot !

Fail to run example test_onehot_dynamic_direct.py

Description

I tried to run the example test_onehot_dynamic_direct.py , but got a segment fault. And I found this fault occurred in parser.parse(model.read()) (line 268). I would appreciate it if you could help me solve this problem.

Environment

docker-image==nvcr.io/nvidia/tensorflow:20.06-tf1-py3
nvidia-driver==470.82
cuda==11.3
TensorRT==8.2.3

onnx==1.10.0
onnxruntime==1.10.0
onnxruntime-gpu==1.10.0
onnx-graphsurgeon==0.3.26
tf2onnx==1.11.1

Log

[02/28/2023-11:46:17] [TRT] [V] Original shape: (_, 64), unsqueezing to: (_, _, _, _)
[02/28/2023-11:46:17] [TRT] [W] ShapedWeights.cpp:173: Weights dense/kernel/read:0 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[02/28/2023-11:46:17] [TRT] [V] Registering layer: dense/MatMul for ONNX node: dense/MatMul
[02/28/2023-11:46:17] [TRT] [V] Original shape: (_, 256, 1, 1), squeezing to: (_, _)
[02/28/2023-11:46:17] [TRT] [V] Registering tensor: dense/MatMul:0 for ONNX tensor: dense/MatMul:0
[02/28/2023-11:46:17] [TRT] [V] dense/MatMul [MatMul] outputs: [dense/MatMul:0 -> (-1, 256)[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Parsing node: Min__6 [Min]
[02/28/2023-11:46:17] [TRT] [V] Searching for input: dense/MatMul:0
[02/28/2023-11:46:17] [TRT] [V] Searching for input: clip_by_value/Minimum/y:0
[02/28/2023-11:46:17] [TRT] [V] Min__6 [Min] inputs: [dense/MatMul:0 -> (-1, 256)[FLOAT]], [clip_by_value/Minimum/y:0 -> ()[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Registering layer: clip_by_value/Minimum/y:0 for ONNX node: clip_by_value/Minimum/y:0
[02/28/2023-11:46:17] [TRT] [V] Registering layer: Min__6 for ONNX node: Min__6
[02/28/2023-11:46:17] [TRT] [V] Registering tensor: Min__6:0 for ONNX tensor: Min__6:0
[02/28/2023-11:46:17] [TRT] [V] Min__6 [Min] outputs: [Min__6:0 -> (-1, 256)[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Parsing node: Max__9 [Max]
[02/28/2023-11:46:17] [TRT] [V] Searching for input: Min__6:0
[02/28/2023-11:46:17] [TRT] [V] Searching for input: clip_by_value/y:0
[02/28/2023-11:46:17] [TRT] [V] Max__9 [Max] inputs: [Min__6:0 -> (-1, 256)[FLOAT]], [clip_by_value/y:0 -> ()[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Registering layer: clip_by_value/y:0 for ONNX node: clip_by_value/y:0
[02/28/2023-11:46:17] [TRT] [V] Registering layer: Max__9 for ONNX node: Max__9
[02/28/2023-11:46:17] [TRT] [V] Registering tensor: Max__9:0 for ONNX tensor: Max__9:0
[02/28/2023-11:46:17] [TRT] [V] Max__9 [Max] outputs: [Max__9:0 -> (-1, 256)[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Parsing node: Cast [Cast]
[02/28/2023-11:46:17] [TRT] [V] Searching for input: Max__9:0
[02/28/2023-11:46:17] [TRT] [V] Cast [Cast] inputs: [Max__9:0 -> (-1, 256)[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [V] Casting to type: int32
[02/28/2023-11:46:17] [TRT] [V] Registering layer: Cast for ONNX node: Cast
[02/28/2023-11:46:17] [TRT] [V] Registering tensor: Cast:0 for ONNX tensor: Cast:0
[02/28/2023-11:46:17] [TRT] [V] Cast [Cast] outputs: [Cast:0 -> (-1, 256)[INT32]], 
[02/28/2023-11:46:17] [TRT] [V] Parsing node: test_onehot [tpat_test_onehot]
[02/28/2023-11:46:17] [TRT] [V] Searching for input: Cast:0
[02/28/2023-11:46:17] [TRT] [V] Searching for input: const_fold_opt__17
[02/28/2023-11:46:17] [TRT] [V] Searching for input: const_fold_opt__19
[02/28/2023-11:46:17] [TRT] [V] test_onehot [tpat_test_onehot] inputs: [Cast:0 -> (-1, 256)[INT32]], [const_fold_opt__17 -> (1)[INT32]], [const_fold_opt__19 -> (2)[FLOAT]], 
[02/28/2023-11:46:17] [TRT] [I] No importer registered for op: tpat_test_onehot. Attempting to import as plugin.
[02/28/2023-11:46:17] [TRT] [I] Searching for plugin: tpat_test_onehot, plugin_version: 1, plugin_namespace: 
[02/28/2023-11:46:17] [TRT] [V] Registering layer: const_fold_opt__17 for ONNX node: const_fold_opt__17
[02/28/2023-11:46:17] [TRT] [V] Registering layer: const_fold_opt__19 for ONNX node: const_fold_opt__19
[02/28/2023-11:46:17] [TRT] [I] Successfully created plugin: tpat_test_onehot
[02/28/2023-11:46:17] [TRT] [V] Registering layer: test_onehot for ONNX node: test_onehot
Segmentation fault

cuda kernel code generated by Ansor‘s search space will use shared memory optimization to auto tuning？

cuda kernel code from tvm is just auto tuning from for loop tile? what is cuda kernel code tuning arguments in TVM Ansor?

Is sparse convolution now supported?

KeyError int8

Model class

from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions

class CustomModel(GPT2LMHeadModel):
    def __init__(self, config):
        super(CustomModel, self).__init__(config)
        self.loss = torch.nn.CrossEntropyLoss()

    def forward(
        self,
        input_ids: Optional[torch.IntTensor] = None
    ) -> torch.FloatTensor:
        
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            use_cache=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
        )
        hidden_states = transformer_outputs[0]

        lm_logits = self.lm_head(hidden_states)
        labels = input_ids
        shift_logits = lm_logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        loss = self.loss(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))        

        # return loss.reshape(-1,1)
        return CausalLMOutputWithCrossAttentions(
            loss=loss.reshape(-1,1),
            logits=None,
            past_key_values=None,
            hidden_states=None,
            attentions=None,
            cross_attentions=None,
        )

Command for conversion:

onnx2plugin(
	input_model_path ="./onnx_tpat/model.onnx", 
	output_model_path="./onnx_tpat/model.tpat.onnx", 
	# node_names="/loss/SoftmaxCrossEntropyLoss", 
	node_types = ["SoftmaxCrossEntropyLoss"], 
	plugin_name_dict={"SoftmaxCrossEntropyLoss": "tpat_softmax_cross_entropy"},
    dynamic_bs=False,
	# dynamic_bs=True, # if True, this operator support dynamic batchsize
	# min_bs=1,
	# opt_bs=64,
	# max_bs=100,
	)

I faced this error:

Couldn't find reusable plugin for node [/loss/SoftmaxCrossEntropyLoss](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/loss/SoftmaxCrossEntropyLoss)
Start auto-tuning!
Compile...
[/tmp/tuning.log](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/tmp/tuning.log) does not exist!


Running...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[2], line 1
----> 1 onnx2plugin(
      2 	input_model_path ="[./onnx_tpat/model.onnx](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/root/working/onnx_tpat/model.onnx)", 
      3 	output_model_path="[./onnx_tpat/model.tpat.onnx](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/root/working/onnx_tpat/model.tpat.onnx)", 
      4 	# node_names="[/loss/SoftmaxCrossEntropyLoss](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/loss/SoftmaxCrossEntropyLoss)", 
      5 	node_types = ["SoftmaxCrossEntropyLoss"], 
      6 	plugin_name_dict={"SoftmaxCrossEntropyLoss": "tpat_softmax_cross_entropy"},
      7     dynamic_bs=False,
      8 	# dynamic_bs=True, # if True, this operator support dynamic batchsize
      9 	# min_bs=1,
     10 	# opt_bs=64,
     11 	# max_bs=100,
     12 	)

File [/workspace/TPAT/python/onnx_to_plugin.py:196](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f74656e736f7272742d6167632d6465746563746f722d72756e2d376531323764333238643637222c2273657474696e6773223a7b22686f7374223a227373683a2f2f6563322d36332d33322d35322d3130302e65752d776573742d312e636f6d707574652e616d617a6f6e6177732e636f6d227d7d.vscode-resource.vscode-cdn.net/workspace/TPAT/python/onnx_to_plugin.py:196), in onnx2plugin(input_model_path, output_model_path, node_names, node_types, plugin_name_dict, dynamic_bs, min_bs, max_bs, opt_bs)
    194         os.remove(dy_input_model)
    195 else:
--> 196     onnx_name_mapping_trt_plugin = generate_plugin_library(
    197         input_model_path, nodes, plugin_name_dict 
    198     )
    199 print("Onnx_name_mapping_trt_plugin: {}".format(onnx_name_mapping_trt_plugin))
    200 OnnxModified(
    201     input_model_path, output_model_path, nodes, onnx_name_mapping_trt_plugin
...
    352             )
    353     input_slot_dict[idx] = self._input_dict[str(i)]
    354 if len(self._allocate_global_memory) != 0:

KeyError: 'int8'

What if the custom op type not in tvm.relay.frontend.onnx._get_convert_map？

RandomNormal not supported for frontend ONNX

I want create a RandomNormal op and I find this operator marked "Y" in TPAT-1.0 Operator Schemas. I build BlazerML-TVM successfully and use command line python onnx_to_plugin.py -i randn_test.onnx -o output.onnx -t RandomNormal, there was an ERROR tvm.error.OpNotImplemented: The following operators are not supported for frontend ONNX: RandomNormal. In /mypath/TPAT/3rdparty/blazerml-tvm/python/tvm/relay/frontend/onnx.py, I find RandomNormal operator was commented in function _get_convert_map, I think it means RandomNormal was not supported.
So how TPAT-1.0 create .so file for RandomNormal ?

precision for one hot plugin is wrong

for one hot，tensorflow result and trt result is not match，trt reslut is all 0

support for one hot plugin with dynamic axis other than batch_size dim

I read the code of one hot plugin generation. It seems that it only support dynamic axis of batch size dimension. Can tpat support one hot plugin generation with dynamic axises other than batch_size dimension?

Error when running one_hot example

2023-03-06 03:27:42,218 - INFO - tf2onnx: ONNX model is saved at model/test_op_plugin.onnx
const_input:  Constant (const_fold_opt__17): (shape=(1,), dtype=<class 'numpy.int32'>)
values:  [256]
const_input:  Constant (const_fold_opt__19): (shape=(2,), dtype=<class 'numpy.float32'>)
values:  [0. 1.]
/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:53: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn("Specified provider '{}' is not in available provider names."
Compile...
/tmp/tuning.log does not exist!
Running...
/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:53: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider'
  warnings.warn("Specified provider '{}' is not in available provider names."
Traceback (most recent call last):
  File "test_onehot_dynamic_direct.py", line 335, in <module>
    main()
  File "test_onehot_dynamic_direct.py", line 229, in main
    trt_plugin_names = onnx2plugin(
  File "/root/examples/../python/onnx_to_plugin.py", line 190, in onnx2plugin
    onnx_name_mapping_trt_plugin = generate_plugin_library(
  File "/root/examples/../python/onnx_to_plugin.py", line 86, in generate_plugin_library
    template_params_list.append(PluginTemplateParams(
  File "/root/python/plugin_template_params.py", line 64, in __init__
    self.parse()
  File "/root/python/plugin_template_params.py", line 163, in parse
    constant_params = self._kernel_generate.constant_param
  File "/root/python/cuda_kernel.py", line 287, in constant_param
    return self._lib.get_constant_params()
AttributeError: 'GraphExecutorFactoryModule' object has no attribute 'get_constant_params'

test_tpat.py error

Traceback (most recent call last):
File "test_tpat.py", line 3860, in
test_abs()
File "test_tpat.py", line 360, in test_abs
op_expect(node, inputs=[x], outputs=[y], op_type=op_type, op_name=op_name)
File "test_tpat.py", line 346, in op_expect
verify_with_ort_with_trt(model, inputs, op_name, np_result=np_result)
File "test_tpat.py", line 251, in verify_with_ort_with_trt
ort_result = get_onnxruntime_output(model, inputs)
File "test_tpat.py", line 225, in get_onnxruntime_output
rep = onnxruntime.backend.prepare(model, "CPU")
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/backend/backend.py", line 138, in prepare
return cls.prepare(bin, device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/backend/backend.py", line 114, in prepare
inf = InferenceSession(model, sess_options=options, providers=providers)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in init
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/usr/local/lib/python3.6/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 370, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_bytes, False, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Failed to load model with error: /onnxruntime_src/onnxruntime/core/graph/model_load_utils.h:47 void onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unordered_map<std::basic_string, int>&, const onnxruntime::logging::Logger&, bool, const string&, int) ONNX Runtime only guarantees support for models stamped with official released onnx opset versions. Opset 16 is under development and support for this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Runtime will not guarantee backward compatibility. Current official support for domain ai.onnx is till opset 15.

TPAT and TRT - no kernel image is available for execution on the device

Hi,

I've successfully converted a model into TensorRT using TPAT generated plugin using the following command:

/usr/src/tensorrt/bin/trtexec --onnx=model_batch1_tpat.onnx --saveEngine=model.plan --buildOnly --verbose --fp16 --workspace=6000 --explicitBatch --noTF32 --plugins="tpat_onehot.so"

but after running trtexec test using this command:

/usr/src/tensorrt/bin/trtexec  --loadEngine=model.plan --verbose --workspace=6000  --plugins="./tpat_onehot.so"

I'm getting the following errors:

[06/02/2022-02:55:34] [E] [TRT] ../rtExt/cuda/cudaPluginV2DynamicExtRunner.cpp (108) - Cuda Error in execute: 209 (no kernel image is available for execution on the device)
[06/02/2022-02:55:34] [E] [TRT] FAILED_EXECUTION: std::exception

I managed to get one TPAT plugin for tpat_onehot.so which doesn't throw this error, but I don't see any difference in the way I generated the plugins. Is there something about the non-deterministic process of generating a plugin using TVM that can cause this behavior?

Thank you!

Could you provide simple tutorial on how to run onnx_to_plugin for simple operator?

Hi, thank you for your great work.
I just wonder how to run onnx_to_plugin on Tile operator. I know it is supported by TPAT 1.0.
I have tried
python3 onnx_to_plugin.py -i model/pfe_baseline32000.onnx -o model/pfe_baseline_tpat.onnx -t Tile
python3 onnx_to_plugin.py -i model/pfe_baseline32000.onnx -o model/pfe_baseline_tpat.onnx -n Tile_16 -dynamic=true -min=1 -max=256 -opt=128
But it returns

Couldn't find reusable plugin for node Tile_16

  7: tvm::relay::StorageAllocaBaseVisitor::DeviceAwareVisitExpr_(tvm::relay::FunctionNode const*)                                  [0/60]
  6: tvm::relay::StorageAllocaBaseVisitor::GetToken(tvm::RelayExpr const&)
  5: tvm::relay::ExprVisitor::VisitExpr(tvm::RelayExpr const&)
  4: tvm::relay::transform::DeviceAwareExprVisitor::VisitExpr_(tvm::relay::CallNode const*)
  3: tvm::relay::StorageAllocator::DeviceAwareVisitExpr_(tvm::relay::CallNode const*)
  2: tvm::relay::StorageAllocaBaseVisitor::CreateToken(tvm::RelayExprNode const*, bool)
  1: tvm::relay::StorageAllocator::CreateTokenOnDevice(tvm::RelayExprNode const*, DLDeviceType, bool)
  0: tvm::relay::StorageAllocator::GetMemorySize(tvm::relay::StorageToken*)
  File "/workspace/TPAT/3rdparty/blazerml-tvm/src/relay/backend/graph_plan_memory.cc", line 408
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (pval != nullptr) is false: Cannot allocate memory symbolic tensor shape [?, ?, ?]

Thank you

Support CUDA 11.5 and TensorRT 8.2.1.3?

Thanks for excellent work! I want to know if this tools can work for CUDA 11.5 and Tensorrt 8.2.1.3?

tensorflow bert model can not build successfully, when to solve?

tensorflow bert model(ckpt or pb or saved_model) can't build successfully by TPAT, the problem is that the nodes input & output of onnx model transformed by tf2onnx have no shape and dtype info(all are None). My current solution is to use onnxruntime generating another onnx which has shape and dtype info, and then use TPAT to generate TensorRT model. So, when do you have a plan to solve this problem?

so build succeed, tensorrt run error

CUDA error at src/tpat_OneHot_2_new.cu:131 code=1(cudaErrorInvalidValue) "cudaMemcpyAsync(workspace, &constant_1, 1 * sizeof(float), cudaMemcpyHostToDevice, stream)"

with one_hot plugin generated, the building process of tensorrt runs into an error.

The onnx file for converting is uploaded to: https://drive.google.com/file/d/1HCCgNwoBN3qQHmsOI2-WKi5bEG-RHt7I/view?usp=share_link

test_tpat error

I have build this project with required gcc=7.3.0 and llvm 9.0.1.
my onnxruntime=1.9.0, onnx=1.10.0

when I run the test_tpat.py. I got following error:
Traceback(most recent call last)
file "test_tpad.py" line 3908, in
test_abs()
.............
File "../python/onnx_to_plugin.py", line 98, in onnx2plugin
input_model_path, nodes, plugin_name_dict
file "../python/onnx_to_plugin.py", line 43, in generate_plugin_library
cuda_kernel.run()
file "../python/cuda_kernel.py" in line 69, in run
mod, params, self._target, include_simple_tasks = True, opt_level = op_level
TypError:autoscheduler_get_tunning_tasks() got an unexpected keyword argument 'opt_level'

can anyone help?
thankyou!

so build succeed, tensorrt run error one hot example

running in the docker created by docker file

get:

[TensorRT] ERROR: INVALID_ARGUMENT: getPluginCreator could not find plugin tpat_test_onehot version 1
In node -1 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[TensorRT] ERROR: Network must have at least one output
[TensorRT] ERROR: Network validation failed.
[ERROR] engine is None

seems plugin not load properly.

how to fix it.

see full log below

root@0390133f0efa:~/examples# python test_onehot_dynamic_direct.py
2023-10-08 08:49:03.568139: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-10-08 08:49:05.214999: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2023-10-08 08:49:05.215383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x678b5d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-08 08:49:05.215399: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-10-08 08:49:05.216549: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2023-10-08 08:49:05.273763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.273974: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x678d2f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-08 08:49:05.273991: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2070, Compute Capability 7.5
2023-10-08 08:49:05.274107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.274219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
2023-10-08 08:49:05.274242: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:05.274250: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:05.274276: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2023-10-08 08:49:05.274284: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2023-10-08 08:49:05.276174: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2023-10-08 08:49:05.276622: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2023-10-08 08:49:05.276637: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2023-10-08 08:49:05.276687: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.276832: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.276916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2023-10-08 08:49:05.276937: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:05.522333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-08 08:49:05.522360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2023-10-08 08:49:05.522366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2023-10-08 08:49:05.522522: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.522698: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:05.522807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7031 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2023-10-08 08:49:05.554951: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:06.211052: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/lib/python3.6/runpy.py:125: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tf2onnx/verbose_logging.py:76: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2023-10-08 08:49:07,330 - WARNING - tensorflow: From /usr/local/lib/python3.6/dist-packages/tf2onnx/verbose_logging.py:76: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2023-10-08 08:49:07.331626: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2023-10-08 08:49:07.357287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.357450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
2023-10-08 08:49:07.357467: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:07.359028: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:07.359702: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2023-10-08 08:49:07.359943: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2023-10-08 08:49:07.361533: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2023-10-08 08:49:07.361962: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2023-10-08 08:49:07.362148: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2023-10-08 08:49:07.362249: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.362408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.362507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2023-10-08 08:49:07.391000: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2023-10-08 08:49:07.391315: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47e4fe0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-08 08:49:07.391330: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-10-08 08:49:07.439144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.439349: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x48208d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-08 08:49:07.439364: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2070, Compute Capability 7.5
2023-10-08 08:49:07.439512: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.439621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
2023-10-08 08:49:07.439641: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:07.439660: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:07.439672: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2023-10-08 08:49:07.439683: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2023-10-08 08:49:07.439703: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2023-10-08 08:49:07.439715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2023-10-08 08:49:07.439726: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2023-10-08 08:49:07.439772: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.439892: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.439976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2023-10-08 08:49:07.440000: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:07.684408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-08 08:49:07.684438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2023-10-08 08:49:07.684444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2023-10-08 08:49:07.684700: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.684903: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.685038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6648 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tf2onnx/tf_loader.py:343: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

2023-10-08 08:49:07,685 - WARNING - tensorflow: From /usr/local/lib/python3.6/dist-packages/tf2onnx/tf_loader.py:343: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

INFO:tensorflow:Froze 0 variables.
2023-10-08 08:49:07,689 - INFO - tensorflow: Froze 0 variables.
INFO:tensorflow:Converted 0 variables to const ops.
2023-10-08 08:49:07,690 - INFO - tensorflow: Converted 0 variables to const ops.
2023-10-08 08:49:07.690924: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.691090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
2023-10-08 08:49:07.691111: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:07.691127: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:07.691137: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2023-10-08 08:49:07.691147: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2023-10-08 08:49:07.691169: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2023-10-08 08:49:07.691179: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2023-10-08 08:49:07.691190: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2023-10-08 08:49:07.691236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.691356: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.691453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2023-10-08 08:49:07.691471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-08 08:49:07.691477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2023-10-08 08:49:07.691482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2023-10-08 08:49:07.691540: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.691668: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.691763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6648 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2023-10-08 08:49:07.692623: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.692728: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2023-10-08 08:49:07.692805: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2023-10-08 08:49:07.693102: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.693195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: NVIDIA GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
2023-10-08 08:49:07.693209: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2023-10-08 08:49:07.693221: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2023-10-08 08:49:07.693230: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2023-10-08 08:49:07.693240: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2023-10-08 08:49:07.693250: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2023-10-08 08:49:07.693260: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2023-10-08 08:49:07.693276: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2023-10-08 08:49:07.693341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.693477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.693565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2023-10-08 08:49:07.693579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-10-08 08:49:07.693585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2023-10-08 08:49:07.693590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2023-10-08 08:49:07.693649: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.693774: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-08 08:49:07.693868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6648 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2023-10-08 08:49:07.696080: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:822] Optimization results for grappler item: graph_to_optimize
2023-10-08 08:49:07.696093: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:824]   constant_folding: Graph size after: 15 nodes (-2), 14 edges (-2), time = 0.933ms.
2023-10-08 08:49:07.696097: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:824]   function_optimizer: function_optimizer did nothing. time = 0.009ms.
2023-10-08 08:49:07.696101: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:824]   constant_folding: Graph size after: 15 nodes (0), 14 edges (0), time = 0.241ms.
2023-10-08 08:49:07.696104: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:824]   function_optimizer: function_optimizer did nothing. time = 0.007ms.
2023-10-08 08:49:07,696 - INFO - tf2onnx: inputs: ['input:0']
2023-10-08 08:49:07,696 - INFO - tf2onnx: outputs: ['output:0']
2023-10-08 08:49:07,698 - INFO - tf2onnx.tfonnx: Using tensorflow=1.15.2, onnx=1.10.0, tf2onnx=1.11.1/1915fb
2023-10-08 08:49:07,699 - INFO - tf2onnx.tfonnx: Using opset <onnx, 11>
2023-10-08 08:49:07,708 - INFO - tf2onnx.tf_utils: Computed 0 values for constant folding
2023-10-08 08:49:07,717 - VERBOSE - tf2onnx.tfonnx: Mapping TF node to ONNX node(s)
2023-10-08 08:49:07,719 - VERBOSE - tf2onnx.tfonnx: Summay Stats:
	tensorflow ops: Counter({'Const': 7, 'Identity': 3, 'Placeholder': 1, 'MatMul': 1, 'Minimum': 1, 'Maximum': 1, 'Cast': 1, 'OneHot': 1})
	tensorflow attr: Counter({'dtype': 8, 'value': 7, 'shape': 1, 'transpose_a': 1, 'transpose_b': 1, 'Truncate': 1, 'to': 1, 'axis': 1})
	onnx mapped: Counter({'Const': 6, 'Identity': 2, 'Placeholder': 1, 'MatMul': 1, 'Minimum': 1, 'Maximum': 1, 'Cast': 1, 'OneHot': 1})
	onnx unmapped: Counter()
2023-10-08 08:49:07,719 - INFO - tf2onnx.optimizer: Optimizing ONNX model
2023-10-08 08:49:07,719 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2023-10-08 08:49:07,722 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2023-10-08 08:49:07,722 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2023-10-08 08:49:07,724 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2023-10-08 08:49:07,724 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2023-10-08 08:49:07,726 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: Concat -1 (1->0), Const -1 (6->5), Unsqueeze -3 (3->0)
2023-10-08 08:49:07,726 - VERBOSE - tf2onnx.optimizer: Apply const_dequantize_optimizer
2023-10-08 08:49:07,727 - VERBOSE - tf2onnx.optimizer.ConstDequantizeOptimizer: no change
2023-10-08 08:49:07,727 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2023-10-08 08:49:07,729 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2023-10-08 08:49:07,729 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2023-10-08 08:49:07,730 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: no change
2023-10-08 08:49:07,730 - VERBOSE - tf2onnx.optimizer: Apply reshape_optimizer
2023-10-08 08:49:07,731 - VERBOSE - tf2onnx.optimizer.ReshapeOptimizer: no change
2023-10-08 08:49:07,731 - VERBOSE - tf2onnx.optimizer: Apply global_pool_optimizer
2023-10-08 08:49:07,733 - VERBOSE - tf2onnx.optimizer.GlobalPoolOptimizer: no change
2023-10-08 08:49:07,733 - VERBOSE - tf2onnx.optimizer: Apply q_dq_optimizer
2023-10-08 08:49:07,734 - VERBOSE - tf2onnx.optimizer.QDQOptimizer: no change
2023-10-08 08:49:07,734 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2023-10-08 08:49:07,736 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: Identity -5 (5->0)
2023-10-08 08:49:07,736 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2023-10-08 08:49:07,737 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2023-10-08 08:49:07,737 - VERBOSE - tf2onnx.optimizer: Apply einsum_optimizer
2023-10-08 08:49:07,738 - VERBOSE - tf2onnx.optimizer.EinsumOptimizer: no change
2023-10-08 08:49:07,738 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2023-10-08 08:49:07,739 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2023-10-08 08:49:07,739 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2023-10-08 08:49:07,740 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2023-10-08 08:49:07,740 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2023-10-08 08:49:07,741 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2023-10-08 08:49:07,741 - VERBOSE - tf2onnx.optimizer: Apply const_dequantize_optimizer
2023-10-08 08:49:07,742 - VERBOSE - tf2onnx.optimizer.ConstDequantizeOptimizer: no change
2023-10-08 08:49:07,742 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2023-10-08 08:49:07,743 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2023-10-08 08:49:07,743 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2023-10-08 08:49:07,744 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: no change
2023-10-08 08:49:07,744 - VERBOSE - tf2onnx.optimizer: Apply reshape_optimizer
2023-10-08 08:49:07,745 - VERBOSE - tf2onnx.optimizer.ReshapeOptimizer: no change
2023-10-08 08:49:07,745 - VERBOSE - tf2onnx.optimizer: Apply global_pool_optimizer
2023-10-08 08:49:07,746 - VERBOSE - tf2onnx.optimizer.GlobalPoolOptimizer: no change
2023-10-08 08:49:07,746 - VERBOSE - tf2onnx.optimizer: Apply q_dq_optimizer
2023-10-08 08:49:07,747 - VERBOSE - tf2onnx.optimizer.QDQOptimizer: no change
2023-10-08 08:49:07,747 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2023-10-08 08:49:07,748 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2023-10-08 08:49:07,748 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2023-10-08 08:49:07,749 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2023-10-08 08:49:07,749 - VERBOSE - tf2onnx.optimizer: Apply einsum_optimizer
2023-10-08 08:49:07,750 - VERBOSE - tf2onnx.optimizer.EinsumOptimizer: no change
2023-10-08 08:49:07,751 - INFO - tf2onnx.optimizer: After optimization: Concat -1 (1->0), Const -1 (6->5), Identity -5 (5->0), Unsqueeze -3 (3->0)
2023-10-08 08:49:07,752 - INFO - tf2onnx:
2023-10-08 08:49:07,752 - INFO - tf2onnx: Successfully converted TensorFlow model model/test_op_test_onehot.pb to ONNX
2023-10-08 08:49:07,752 - INFO - tf2onnx: Model inputs: ['input:0']
2023-10-08 08:49:07,752 - INFO - tf2onnx: Model outputs: ['output:0']
2023-10-08 08:49:07,752 - INFO - tf2onnx: ONNX model is saved at model/test_op_plugin.onnx
const_input:  Constant (const_fold_opt__18): (shape=(1,), dtype=int32)
values:  [256]
const_input:  Constant (const_fold_opt__19): (shape=(2,), dtype=float32)
values:  [0. 1.]
[08:49:08] /workspace/TPAT/3rdparty/blazerml-tvm/src/tir/transforms/loop_partition.cc:590: Warning: Cannot prove: ((((floordiv(((any_dim*256) + 511), 512) - 1) - floordiv(any_dim, 2)) + 1) >= 0), when generating the post doubt loop
Compile...
/tmp/tuning.log does not exist!




Running...
Compile...
/tmp/tuning.log does not exist!




Running...
Compile...
/tmp/tuning.log does not exist!




Running...
rm -rf ./lib/tpat_test_onehot.so ./obj/*
if [ ! -d ./obj ]; then mkdir -p ./obj; fi
/usr/local/cuda-11.0//bin/nvcc -w -std=c++11 -M -MT tpat_test_onehot.o -I. -I/usr/local/cuda-11.0//samples/common/inc -I/usr/local/cuda-11.0//include -I/usr/include/x86_64-linux-gnu -I/usr/include/x86_64-linux-gnu -I/usr/include -o tpat_test_onehot.d src/tpat_test_onehot.cu
/usr/local/cuda-11.0//bin/nvcc -w -std=c++11 -I. -I/usr/local/cuda-11.0//samples/common/inc -I/usr/local/cuda-11.0//include -I/usr/include/x86_64-linux-gnu -I/usr/include/x86_64-linux-gnu -I/usr/include -Xcompiler -fPIC -arch=sm_75 -o tpat_test_onehot.o -c src/tpat_test_onehot.cu
# /usr/local/cuda-11.0//bin/nvcc -w -std=c++11 -I. -I/usr/local/cuda-11.0//samples/common/inc -I/usr/local/cuda-11.0//include -I/usr/include/x86_64-linux-gnu -I/usr/include/x86_64-linux-gnu -I/usr/include -Xcompiler -fPIC -arch=sm_75 -G -lineinfo -o tpat_test_onehot.o -c src/tpat_test_onehot.cu
g++ -w -std=c++11 -shared -o tpat_test_onehot.so tpat_test_onehot.o -L/usr/local/cuda-11.0//lib64 -L/usr/local/cuda-11.0//lib64 -L/workspace/TensorRT-8.0.3.4/lib  -lnvinfer -lcudart -lcuda -Wl,-rpath=/usr/local/cuda-11.0//lib64 -Wl,-rpath=/usr/local/cuda-11.0//lib64 -Wl,-rpath=/workspace/TensorRT-8.0.3.4/lib
if [ ! -d  ./lib ]; then mkdir -p ./lib; fi
mv *.o   ./obj/
mv *.d   ./obj/
mv *.so ./lib/
Onnx_name_mapping_trt_plugin: {'test_onehot': 'tpat_test_onehot'}
load ./trt_plugin/lib/tpat_test_onehot
[TensorRT] VERBOSE: Registered plugin creator - ::GridAnchor_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::NMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Reorg_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Region_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Clip_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::LReLU_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PriorBox_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Normalize_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::RPROI_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::BatchedNMS_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::FlattenConcat_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::CropAndResize version 1
[TensorRT] VERBOSE: Registered plugin creator - ::DetectionLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Proposal version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ProposalLayer_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::ResizeNearest_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::Split version 1
[TensorRT] VERBOSE: Registered plugin creator - ::SpecialSlice_TRT version 1
[TensorRT] VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
[TensorRT] VERBOSE: ModelImporter.cpp:202: Adding network input: input:0 with dtype: float32, dimensions: (-1, 64)
[TensorRT] VERBOSE: ImporterContext.hpp:116: Registering tensor: input:0 for ONNX tensor: input:0
[TensorRT] VERBOSE: ModelImporter.cpp:90: Importing initializer: dense/kernel/read:0
[TensorRT] VERBOSE: ModelImporter.cpp:90: Importing initializer: clip_by_value/Minimum/y:0
[TensorRT] VERBOSE: ModelImporter.cpp:90: Importing initializer: clip_by_value/y:0
[TensorRT] VERBOSE: ModelImporter.cpp:90: Importing initializer: const_fold_opt__18
[TensorRT] VERBOSE: ModelImporter.cpp:90: Importing initializer: const_fold_opt__19
[TensorRT] VERBOSE: ModelImporter.cpp:103: Parsing node: dense/MatMul [MatMul]
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: input:0
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: dense/kernel/read:0
[TensorRT] VERBOSE: ModelImporter.cpp:125: dense/MatMul [MatMul] inputs: [input:0 -> (-1, 64)], [dense/kernel/read:0 -> (64, 256)],
[TensorRT] VERBOSE: builtin_op_importers.cpp:2053: GEMM: using FC layer instead of MM because all criteria were met.
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] VERBOSE: onnx2trt_utils.cpp:1793: Original shape: (_, 64), unsqueezing to: (_, _, _, _)
[TensorRT] VERBOSE: ImporterContext.hpp:141: Registering layer: dense/MatMul for ONNX node: dense/MatMul
[TensorRT] VERBOSE: onnx2trt_utils.cpp:1641: Original shape: (_, 256, 1, 1), squeezing to: (_, _)
[TensorRT] VERBOSE: ImporterContext.hpp:116: Registering tensor: dense/MatMul:0 for ONNX tensor: dense/MatMul:0
[TensorRT] VERBOSE: ModelImporter.cpp:179: dense/MatMul [MatMul] outputs: [dense/MatMul:0 -> (-1, -1)],
[TensorRT] VERBOSE: ModelImporter.cpp:103: Parsing node: Min__6 [Min]
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: dense/MatMul:0
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: clip_by_value/Minimum/y:0
[TensorRT] VERBOSE: ModelImporter.cpp:125: Min__6 [Min] inputs: [dense/MatMul:0 -> (-1, -1)], [clip_by_value/Minimum/y:0 -> ()],
[TensorRT] VERBOSE: ImporterContext.hpp:141: Registering layer: Min__6 for ONNX node: Min__6
[TensorRT] VERBOSE: ImporterContext.hpp:116: Registering tensor: Min__6:0 for ONNX tensor: Min__6:0
[TensorRT] VERBOSE: ModelImporter.cpp:179: Min__6 [Min] outputs: [Min__6:0 -> (-1, -1)],
[TensorRT] VERBOSE: ModelImporter.cpp:103: Parsing node: Max__9 [Max]
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: Min__6:0
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: clip_by_value/y:0
[TensorRT] VERBOSE: ModelImporter.cpp:125: Max__9 [Max] inputs: [Min__6:0 -> (-1, -1)], [clip_by_value/y:0 -> ()],
[TensorRT] VERBOSE: ImporterContext.hpp:141: Registering layer: Max__9 for ONNX node: Max__9
[TensorRT] VERBOSE: ImporterContext.hpp:116: Registering tensor: Max__9:0 for ONNX tensor: Max__9:0
[TensorRT] VERBOSE: ModelImporter.cpp:179: Max__9 [Max] outputs: [Max__9:0 -> (-1, -1)],
[TensorRT] VERBOSE: ModelImporter.cpp:103: Parsing node: Cast [Cast]
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: Max__9:0
[TensorRT] VERBOSE: ModelImporter.cpp:125: Cast [Cast] inputs: [Max__9:0 -> (-1, -1)],
[TensorRT] VERBOSE: builtin_op_importers.cpp:320: Casting to type: int32
[TensorRT] VERBOSE: ImporterContext.hpp:141: Registering layer: Cast for ONNX node: Cast
[TensorRT] VERBOSE: ImporterContext.hpp:116: Registering tensor: Cast:0 for ONNX tensor: Cast:0
[TensorRT] VERBOSE: ModelImporter.cpp:179: Cast [Cast] outputs: [Cast:0 -> (-1, -1)],
[TensorRT] VERBOSE: ModelImporter.cpp:103: Parsing node: test_onehot [tpat_test_onehot]
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: Cast:0
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: const_fold_opt__18
[TensorRT] VERBOSE: ModelImporter.cpp:119: Searching for input: const_fold_opt__19
[TensorRT] VERBOSE: ModelImporter.cpp:125: test_onehot [tpat_test_onehot] inputs: [Cast:0 -> (-1, -1)], [const_fold_opt__18 -> (1)], [const_fold_opt__19 -> (2)],
[TensorRT] INFO: ModelImporter.cpp:135: No importer registered for op: tpat_test_onehot. Attempting to import as plugin.
[TensorRT] INFO: builtin_op_importers.cpp:3659: Searching for plugin: tpat_test_onehot, plugin_version: 1, plugin_namespace:
[TensorRT] ERROR: INVALID_ARGUMENT: getPluginCreator could not find plugin tpat_test_onehot version 1
In node -1 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[TensorRT] ERROR: Network must have at least one output
[TensorRT] ERROR: Network validation failed.
[ERROR] engine is None
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
root@0390133f0efa:~/examples#

unsupported ptx version error

I run the example for onehot plugin and ran into this error. The gpu info printed out with lspci | grep -i vga command is : 1VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1)1. I think this is rtx3090 gpu. The nvidia driver version 510.

However, on another machine with lspci | grep -i vga command printed out: 37:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1), the onehot example can be run successfully (the nvidia driver version is 515). This machine seems to be also with rtx3090 gpu but the lspci | grep -i vga command printed out message is different from the previous machine. Right now I am not able to find out the cause.

Docker image

Hello, I have some problems in building the environment. Do you have the open source Docker image built?

Is maintained of this repo?

can not find project_libbacktrace and report an error while building tvm form source

I was trying to to generate a plugin, but I can not compile the 3rdparty TVM from source. It can not dpwnload “project_libbacktrace” automaticly. where should I download it?

CMake Error at /usr/local/share/cmake-3.21/Modules/ExternalProject.cmake:2866 (message):
  No download info given for 'project_libbacktrace' and its source directory:

Is it custom operator supported？

Is it unbuilt-in operators of TVM supported？If it is, then in which function dose the work to generate the computes and schedules？How about a custom operator？

Half model error

感谢大佬们开源的工作。
在使用TPAT产生插件ScatterElements的时候，产生如下报错
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=3 Dimension=0
我的运行命令为：
python onnx_to_plugin.py -i CodeFormer.onnx -o plan.onnx -n ScatterElements_1022 -dynamic=true -min=1 -max=6 -opt=3

报错位置发生在python/cuda_kernels.py compute_tensor()，重新加载half_model.onnx的时候，报错日志如下

 File "/data//TPAT/python/onnx_to_plugin.py", line 287, in <module>
    onnx2plugin(
  File "/data//TPAT/python/onnx_to_plugin.py", line 190, in onnx2plugin
    onnx_name_mapping_trt_plugin = generate_plugin_library(
  File "/data//TPAT/python/onnx_to_plugin.py", line 85, in generate_plugin_library
    cuda_kernel.run()
  File "/data//TPAT/python/cuda_kernel.py", line 54, in run
    graph_def = self.extract_target_onnx_node(self._onnx_model)
  File "/data//TPAT/python/cuda_kernel.py", line 211, in extract_target_onnx_node
    computed_tensor_shapes = self.compute_tensor_shape(
  File "/data//TPAT/python/cuda_kernel.py", line 163, in compute_tensor_shape
    session = ort.InferenceSession(half_model_path, providers=EP_list)
  File "/home/ningnx/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/ningnx/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 408, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=2 Dimension=0

附上onnx的地址
原始onnx
half model

希望大佬帮忙可以解答一下这个问题，感恩！

when to support scan operator?

when to support scan operator? I need to use this operator! Do you have a plan to support? How can I do at this time?

Conversion Error for IsInf OP

We converted the ONNX model with IsInf OPs and it succeeded. We noticed that the IsInf OP is implemented by tpat_ininf and Cast OP. When we convert the ONNX to TensorRT model, the error happen as follow:

onnx2trt.py:29: DeprecationWarning: Use set_memory_pool_limit instead.
config.max_workspace_size =( 1 << 20 ) * 3 * 1024
Loading ONNX file from path /home/tensorrt/model_testing-sim.onnx...
Beginning ONNX file parsing
[08/16/2022-10:10:18] [TRT] [W] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
raw shape of 0 is: (6, 3, 928, 1600)
Completed parsing of ONNX file
Building an engine from file /home/tensorrt/model_testing-sim.onnx; this may take a while...
onnx2trt.py:54: DeprecationWarning: Use build_serialized_network instead.
engine = builder.build_engine(network,config)
[08/16/2022-10:11:00] [TRT] [E] 1: [castBuilder.cpp::addSupportedFormats::117] Error Code 1: Internal Error (Cast output type does not support bool.)
Completed creating Engine
Traceback (most recent call last):
File "onnx2trt.py", line 57, in
f.write(engine.serialize())
AttributeError: 'NoneType' object has no attribute 'serialize'

Do you still have this issue for IsInf OP? How can I solve this issue?

No radical Subgraph optimization for TensorRT

TPAT in fact is one node by one node to optimize，namely use TVM Ansor to auto tuning one node every time，butTVM is optimized based on subgraph，as follows：

I want to know：radical subgraph optimization for TPAT is work when using TVM subgraph？modify the TVM code can realize this function？

out of memeory

Traceback (most recent call last):
  File "test_onehot_dynamic_direct.py", line 344, in <module>
    main()
  File "test_onehot_dynamic_direct.py", line 236, in main
    trt_plugin_names = onnx2plugin(
  File "/root/tpat/examples/../python/onnx_to_plugin.py", line 190, in onnx2plugin
    onnx_name_mapping_trt_plugin = generate_plugin_library(
  File "/root/tpat/examples/../python/onnx_to_plugin.py", line 85, in generate_plugin_library
    cuda_kernel.run()
  File "/root/tpat/python/cuda_kernel.py", line 83, in run
    self._module = graph_executor.create(
  File "/workspace/TPAT/3rdparty/blazerml-tvm/python/tvm/contrib/graph_executor.py", line 66, in create
    return GraphModule(fcreate(graph_json_str, libmod, *device_type_id))
  File "/workspace/TPAT/3rdparty/blazerml-tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  8: TVMFuncCall
  7: _ZNSt17_Function_handlerIFvN3
  6: tvm::runtime::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const [clone .isra.0]
  5: tvm::runtime::GraphExecutorCreate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::Module const&, std::vector<DLDevice, std::allocator<DLDevice> > const&, tvm::runtime::PackedFunc)
  4: tvm::runtime::GraphExecutor::Init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::Module, std::vector<DLDevice, std::allocator<DLDevice> > const&, tvm::runtime::PackedFunc)
  3: tvm::runtime::GraphExecutor::SetupStorage()
  2: tvm::runtime::NDArray::Empty(tvm::runtime::ShapeTuple, DLDataType, DLDevice, tvm::runtime::Optional<tvm::runtime::String>)
  1: tvm::runtime::DeviceAPI::AllocDataSpace(DLDevice, int, long const*, DLDataType, tvm::runtime::Optional<tvm::runtime::String>)
  0: tvm::runtime::CUDADeviceAPI::AllocDataSpace(DLDevice, unsigned long, unsigned long, DLDataType)
  File "/workspace/TPAT/3rdparty/blazerml-tvm/src/runtime/cuda/cuda_device_api.cc", line 123
TVMError:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory

I run for a onehot plugin with node input [xxx, 561, 561], depth 64. The above error message occurred. The shape of the node input seems not to use so much memory.

what‘s the blazerml-tvm build error below？

The build log blow：
[ 89%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/contrib/example_target_hooks/relay_to_tir.cc.o
[ 89%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/contrib/example_target_hooks/target.cc.o
[ 89%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/contrib/example_target_hooks/tir_to_runtime.cc.o
[ 90%] Building CXX object CMakeFiles/tvm_objs.dir/src/contrib/hybrid/codegen_hybrid.cc.o
[ 90%] Built target tvm_objs
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
The command '/bin/sh -c cd /workspace/TPAT/3rdparty/blazerml-tvm/build/ && cmake .. && make -j8' returned a non-zero code: 2

who can tell me what wrong？

无法跑通 example

按照 ReadMe 中的指导进行一下操作：

使用 dockerfile 构建镜像
使用 1 张 v100-32g 卡创建容器
cd 到 /workspace/TPAT/examples 目录下执行 python test_onehot_dynamic_direct.py ，出现 segfault，简单定位是在 onnx2plugin 的 cuda_kernel.run() 处出现异常。

因为直接使用了 dockerfile build ，所以没有修改 TRT_LIB_PATH 的值。但我看了下默认 TRT_LIB_PATH 的值为 /root/workspace/download/ft_local/TensorRT-8.0.0.3/lib ，这个目录在镜像里面并不存在，请问是否还需要重新设置这个值，需要的话应该如何设置呢？

the error occurred:

cp: cannot stat 'cmake/config.cmake': No such file or directory

Support for dynamic shape ?

Does it support Dynamic shape kernel generation?

Cuda Error in execute: 209 (no kernel image is available for execution on the device)

Hi,

I'm trying to run TPAT on Jetson AGX with Jetpack 4.4.1

I managed to install everything using the docker image with small modifications to the Dockerfile which now looks like this:

FROM nvcr.io/nvidia/l4t-tensorflow:r32.4.4-tf1.15-py3
RUN apt-get update && apt-get install build-essential cmake -y
RUN wget -O "clang+llvm-9.0.1-aarch64-linux-gnu.tar.xz" https://github.com/llvm/llvm-project/releases/download/llvmorg-9.0.1/clang+llvm-9.0.1-aarch64-linux-gnu.tar.xz \
    && tar -xvf clang+llvm-9.0.1-aarch64-linux-gnu.tar.xz && mkdir -p /usr/local/llvm/ \
    && mv clang+llvm-9.0.1-aarch64-linux-gnu/* /usr/local/llvm/
RUN python3 -m pip install --upgrade pip
RUN pip3 install buildtools onnx==1.10.0 
RUN pip3 install pycuda nvidia-pyindex
RUN apt-get install git
RUN pip install onnx-graphsurgeon onnxruntime==1.9.0 tf2onnx xgboost==1.5.2
RUN git clone --recursive https://github.com/Tencent/TPAT.git /workspace/TPAT && cd /workspace/TPAT/3rdparty/blazerml-tvm && mkdir build && cp cmake/config.cmake build && cd build 
RUN sed -i 's/set(USE_LLVM OFF)/set(USE_LLVM \/usr\/local\/llvm\/bin\/llvm-config)/g' /workspace/TPAT/3rdparty/blazerml-tvm/build/config.cmake 
RUN sed -i 's/set(USE_CUDA OFF)/set(USE_CUDA ON)/g' /workspace/TPAT/3rdparty/blazerml-tvm/build/config.cmake
RUN cd /workspace/TPAT/3rdparty/blazerml-tvm/build/ && cmake .. && make -j8 
ENV TVM_HOME="/workspace/TPAT/3rdparty/blazerml-tvm/"
ENV PYTHONPATH="$TVM_HOME/python:${PYTHONPATH}"

After running OPENBLAS_CORETYPE=ARMV8 python3 test_tpat.py I get this error:

Onnx_name_mapping_trt_plugin: {'abs_0': 'tpat_abs_0'}
[TensorRT] ERROR: ../rtExt/cuda/cudaPluginV2DynamicExtRunner.cpp (108) - 
Cuda Error in execute: 209 (no kernel image is available for execution on the device)

And it triggers error on assert:

[TensorRT] ERROR: FAILED_EXECUTION: std::exception
[[[1.7640524  0.4001572  0.978738   2.2408931  1.867558  ]
  [0.9772779  0.95008844 0.1513572  0.10321885 0.41059852]
  [0.14404356 1.4542735  0.7610377  0.12167501 0.44386324]
  [0.33367434 1.4940791  0.20515826 0.3130677  0.85409576]]

 [[2.5529897  0.6536186  0.8644362  0.742165   2.2697546 ]
  [1.4543657  0.04575852 0.18718386 1.5327792  1.4693588 ]
  [0.15494743 0.37816253 0.88778573 1.9807965  0.34791216]
  [0.15634897 1.2302907  1.2023798  0.3873268  0.30230275]]

 [[1.048553   1.420018   1.7062702  1.9507754  0.5096522 ]
  [0.4380743  1.2527953  0.7774904  1.6138978  0.21274029]
  [0.89546657 0.3869025  0.51080513 1.1806322  0.02818223]
  [0.42833188 0.06651722 0.3024719  0.6343221  0.36274117]]]
================
[array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)]
trt cross_check output  False
Traceback (most recent call last):
  File "test_tpat.py", line 3860, in <module>
    test_abs()
  File "test_tpat.py", line 360, in test_abs
    op_expect(node, inputs=[x], outputs=[y], op_type=op_type, op_name=op_name)
  File "test_tpat.py", line 346, in op_expect
    verify_with_ort_with_trt(model, inputs, op_name, np_result=np_result)
  File "test_tpat.py", line 300, in verify_with_ort_with_trt
    assert ret, "result check False"
AssertionError: result check False

Can you please provide some guidance on what might be the problem?