Step to reproduce <div class="snippet-clipboard-content notranslate position-rel

Step to reproduce <div class="snippet-clipboard-content notranslate

Infer on vgg16 with batch size 32 took more than 1 hour before run actual inference about ppl.nn HOT 3 CLOSED

openppl-public commented on August 16, 2024

Infer on vgg16 with batch size 32 took more than 1 hour before run actual inference

from ppl.nn.

Comments (3)

ltj2013 commented on August 16, 2024

Step to reproduce

./pplnn-build/tools/pplnn --in-shapes 32_3_224_224 --dims 32_3_224_224 --warmuptimes 200 --runningtimes 200 --onnx-model vgg16.onnx
[INFO][2021-07-05 08:31:30.885][pplnn.cc:683] ppl.nn version: v0.1.0-dirty
[INFO][2021-07-05 08:31:32.207][pplnn.cc:88] ***** register CudaEngine *****
[INFO][2021-07-05 08:31:32.940][simple_graph_partitioner.cc:90] total partition(s) of graph[torch-jit-export]: 1.
[INFO][2021-07-05 08:31:33.295][opt_graph.cc:187] Create 71 TensorImpl
[INFO][2021-07-05 08:31:33.295][opt_graph.cc:299] added 56 new bridge kernels
[INFO][2021-07-05 09:46:30.989][opt_graph.cc:461] deleted 52 bridge kernels
[INFO][2021-07-05 09:46:46.325][pplnn.cc:523] ----- input info -----
[INFO][2021-07-05 09:46:46.326][pplnn.cc:526] input[0]:
[INFO][2021-07-05 09:46:46.326][pplnn.cc:527]     name: input.1
[INFO][2021-07-05 09:46:46.326][pplnn.cc:534]     dim(s): 32 3 224 224
[INFO][2021-07-05 09:46:46.326][pplnn.cc:536]     DataType: FLOAT32
[INFO][2021-07-05 09:46:46.326][pplnn.cc:537]     DataFormat: NDARRAY
[INFO][2021-07-05 09:46:46.326][pplnn.cc:538]     NumBytesIncludePadding: 19267584
[INFO][2021-07-05 09:46:46.326][pplnn.cc:539]     NumBytesExcludePadding: 19267584
[INFO][2021-07-05 09:46:46.326][pplnn.cc:542] ----- output info -----
[INFO][2021-07-05 09:46:46.326][pplnn.cc:545] output[0]:
[INFO][2021-07-05 09:46:46.326][pplnn.cc:546]     name: 70
[INFO][2021-07-05 09:46:46.326][pplnn.cc:553]     dim(s): 32 1000
[INFO][2021-07-05 09:46:46.326][pplnn.cc:555]     DataType: FLOAT32
[INFO][2021-07-05 09:46:46.326][pplnn.cc:556]     DataFormat: NDARRAY
[INFO][2021-07-05 09:46:46.326][pplnn.cc:557]     NumBytesIncludePadding: 128000
[INFO][2021-07-05 09:46:46.326][pplnn.cc:558]     NumBytesExcludePadding: 128000
[INFO][2021-07-05 09:46:46.326][pplnn.cc:561] ----------------------
[INFO][2021-07-05 09:46:46.326][pplnn.cc:791] Run() costs: 9175.929688 ms.
[INFO][2021-07-05 09:46:46.326][pplnn.cc:799] Run ok

As shown in log, the time start on 08:31 and start inference on 09:46, took 75 minutes to prepare. Is it normal？the model was import from torchvison and export to onnx

import torchvision
dummy_input = torch.randn(32, 3, 224, 224)
model = torchvision.models.vgg16(pretrained = True)
model.eval()
torch.onnx.export(model, dummy_input, "vgg16.onnx", opset_version=11)

Also, test with batch size = 1, the time is pretty normal.

# ./pplnn-build/tools/pplnn --onnx-model vgg16.onnx --in-shapes 1_3_224_224 --dims 1_3_224_224 --warmuptimes 100 --runningtimes 100
[INFO][2021-07-05 05:21:44.428][pplnn.cc:683] ppl.nn version: v0.1.0-dirty
[INFO][2021-07-05 05:21:46.437][pplnn.cc:88] ***** register CudaEngine *****
[INFO][2021-07-05 05:21:47.230][simple_graph_partitioner.cc:90] total partition(s) of graph[torch-jit-export]: 1.
[INFO][2021-07-05 05:21:47.511][opt_graph.cc:187] Create 71 TensorImpl
[INFO][2021-07-05 05:21:47.511][opt_graph.cc:299] added 56 new bridge kernels
[INFO][2021-07-05 05:24:30.634][opt_graph.cc:461] deleted 52 bridge kernels
[INFO][2021-07-05 05:24:31.300][pplnn.cc:523] ----- input info -----
[INFO][2021-07-05 05:24:31.300][pplnn.cc:526] input[0]:
[INFO][2021-07-05 05:24:31.300][pplnn.cc:527]     name: input.1
[INFO][2021-07-05 05:24:31.300][pplnn.cc:534]     dim(s): 1 3 224 224
[INFO][2021-07-05 05:24:31.300][pplnn.cc:536]     DataType: FLOAT32
[INFO][2021-07-05 05:24:31.300][pplnn.cc:537]     DataFormat: NDARRAY
[INFO][2021-07-05 05:24:31.300][pplnn.cc:538]     NumBytesIncludePadding: 602112
[INFO][2021-07-05 05:24:31.300][pplnn.cc:539]     NumBytesExcludePadding: 602112
[INFO][2021-07-05 05:24:31.300][pplnn.cc:542] ----- output info -----
[INFO][2021-07-05 05:24:31.300][pplnn.cc:545] output[0]:
[INFO][2021-07-05 05:24:31.300][pplnn.cc:546]     name: 70
[INFO][2021-07-05 05:24:31.300][pplnn.cc:553]     dim(s): 1 1000
[INFO][2021-07-05 05:24:31.300][pplnn.cc:555]     DataType: FLOAT32
[INFO][2021-07-05 05:24:31.300][pplnn.cc:556]     DataFormat: NDARRAY
[INFO][2021-07-05 05:24:31.300][pplnn.cc:557]     NumBytesIncludePadding: 4000
[INFO][2021-07-05 05:24:31.300][pplnn.cc:558]     NumBytesExcludePadding: 4000
[INFO][2021-07-05 05:24:31.300][pplnn.cc:561] ----------------------
[INFO][2021-07-05 05:24:31.300][pplnn.cc:791] Run() costs: 344.269989 ms.
[INFO][2021-07-05 05:24:31.300][pplnn.cc:799] Run ok

Actually， it may take hours to select the fastest algo for conv or gemm ops in prepare stage, especially when the batch size is large.

from ppl.nn.

Si-XU commented on August 16, 2024

The time cost for batch = 32 is reasonable. The algorithm selection process will execute the real tensor size and select the shortest time-consuming one from over 6000 kernels. Thus, the time cost for 32 batch model will be approximately 32 times longer than the single batch.

If you cannot stand over one hour cost for 32 batch. There are two ways to reduce time cost for preparing stage:
1 use '--quick-select' to skip algo selection.
2 reduce the dim size by '--dims', like '--dims 3_3_224_224'

from ppl.nn.

zerollzeng commented on August 16, 2024

Thanks for the explaination.

from ppl.nn.

Infer on vgg16 with batch size 32 took more than 1 hour before run actual inference about ppl.nn HOT 3 CLOSED

Comments (3)

Step to reproduce

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent