Hi, I'm a little confused about the forward function in SeqCLR. In the <code class="no

The shape of projector's input about semimtr-text-recognition HOT 5 CLOSED

amazon-science commented on May 29, 2024

The shape of projector's input

from semimtr-text-recognition.

Comments (5)

aaberdam commented on May 29, 2024

Hi,
Thank you for the interest in our work.
The SeqCLR projection is applied on the working_layer (see here), which can be backbone_feature or feature. In the latter case, the shape of the features is indeed (N, T, E) (see here). If you work on the backbone_feature then the shape is (N, E, H, W) (see here). Therefore, in this case we first reshape them to become (N, H*W, E) (see here).

To your specific suggestion, the goal of the projection is to linearly transform the features into a different (usually of lower-dimension) subspace. Therefore, we only want to change the number of the channels, i.e., E, and to preserve the spatial dimensions of H and W.

Let me know if you have following up questions,
Aviad

from semimtr-text-recognition.

YusenZhang826 commented on May 29, 2024

Now I know the goal of the projection. Then I want to know how you handle the output of the visual output in the fine-tune stage(training with labeled data)? The tensor shape (N, E, H, W) is reshaped to (N, H*W, E) or (N, W, E*H) before fed into the CTC or attention decoder? I think the latter(N, W, E*H) is more common in text recognition tasks. But it is inconsistent(N, H*W, E) with the pre-training process.
Looking forward to your response. Thanks!

from semimtr-text-recognition.

aaberdam commented on May 29, 2024

Hi,

In the vision model, there is a transformer unit which is applied after the backbone:

semimtr-text-recognition/semimtr/modules/model_vision.py

Line 44 in d4921a3

attn_vecs, attn_scores = self.attention(features) # (N, T, E), (N, T, H, W)

This 2D attention layer operates directly on the feature map of the size of (N, T, H, W). It outputs a tensor of the shape of (N, T, E).
To answer your question explicitly - We use 2D attention-based decoder and therefore we don't need the reshape that you mentioned for the supervised fine-tuning.

I hope that it's more clear now,
Aviad

from semimtr-text-recognition.

YusenZhang826 commented on May 29, 2024

Ok, I will learn more about the code, Thanks again!

from semimtr-text-recognition.

aaberdam commented on May 29, 2024

You welcome :)
I'm closing the issue. If you have additional questions, you can re-open it.
Aviad

from semimtr-text-recognition.

The shape of projector's input about semimtr-text-recognition HOT 5 CLOSED

Comments (5)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent