Try feed into a frame with shape: torch.Size([1, 3, 224, 224

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

what's the input shape of model? about yowo HOT 8 CLOSED

lucasjinreal commented on July 19, 2024

what's the input shape of model?

from yowo.

Comments (8)

okankop commented on July 19, 2024 1

Initial dim of the tensor is always reserved for the batch size. So if you want to load only one clip, you need to expand initial dim to 1. Moreover, you are concatenating the frames in the wrong dim. Your final tensor shape should be [1,3,16,h,w] such that line 'x_2d = input[:, :, -1, :, :]' in "model.py" can successfully takes the last frame of the clip.

For a successful inference you need also processing of clip (such as normalization etc) same as the test phase.

from yowo.

wei-tim commented on July 19, 2024

@jinfagang
Thanks for your interest. Our 3D-CNN model extracts spatial-temporal information from an input clip consisting of several successive frames, thus you need to concatenate them (8/16 frames) together as a clip.

from yowo.

lucasjinreal commented on July 19, 2024

How to specific using 2d or 3d? it seems default use them all. 8/16 means 8~16 frames?

from yowo.

wei-tim commented on July 19, 2024

@jinfagang
3D model helps to understand an action, while 2D model boosts the localization precision. Our algorithm fuses both 3D and 2D information to achieve the spatial-temporal localization task. If only a single model is employed, the result will be worse. You can find the corresponding ablation study in our paper.

We provide two options: 8 frames or 16 frames. Model with 8 frames performs a little bit worse than 16 frames yet more efficient. The experiment results are also presented in the paper.

from yowo.

lucasjinreal commented on July 19, 2024

thanks, I got it. That means the input video at least 16 frames for inference?

from yowo.

wei-tim commented on July 19, 2024

@jinfagang
For the model with 16 frames, yes.

from yowo.

kinivi commented on July 19, 2024

@wei-tim can I manually edit clip size to 32?

from yowo.

GxZhu commented on July 19, 2024

@jinfagang Running into the same error as you. I read in an image frame as an np.array and made the shape of the image [3,h,w]. Then I concatenated 16 consecutive frames into an array with shape [16, 3, h, w] before converting to Tensor.

I am still missing a dimension (shape length is current 4 and not 5). Did you find a fix?

from yowo.

Recommend Projects

what's the input shape of model? about yowo HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent