Giter Site home page Giter Site logo

Comments (5)

aaberdam avatar aaberdam commented on May 29, 2024

Hi,
Thank you for the interest in our work.
The SeqCLR projection is applied on the working_layer (see here), which can be backbone_feature or feature. In the latter case, the shape of the features is indeed (N, T, E) (see here). If you work on the backbone_feature then the shape is (N, E, H, W) (see here). Therefore, in this case we first reshape them to become (N, H*W, E) (see here).

To your specific suggestion, the goal of the projection is to linearly transform the features into a different (usually of lower-dimension) subspace. Therefore, we only want to change the number of the channels, i.e., E, and to preserve the spatial dimensions of H and W.

Let me know if you have following up questions,
Aviad

from semimtr-text-recognition.

YusenZhang826 avatar YusenZhang826 commented on May 29, 2024

Now I know the goal of the projection. Then I want to know how you handle the output of the visual output in the fine-tune stage(training with labeled data)? The tensor shape (N, E, H, W) is reshaped to (N, H*W, E) or (N, W, E*H) before fed into the CTC or attention decoder? I think the latter(N, W, E*H) is more common in text recognition tasks. But it is inconsistent(N, H*W, E) with the pre-training process.
Looking forward to your response. Thanks!

from semimtr-text-recognition.

aaberdam avatar aaberdam commented on May 29, 2024

Hi,

In the vision model, there is a transformer unit which is applied after the backbone:

attn_vecs, attn_scores = self.attention(features) # (N, T, E), (N, T, H, W)

This 2D attention layer operates directly on the feature map of the size of (N, T, H, W). It outputs a tensor of the shape of (N, T, E).
To answer your question explicitly - We use 2D attention-based decoder and therefore we don't need the reshape that you mentioned for the supervised fine-tuning.

I hope that it's more clear now,
Aviad

from semimtr-text-recognition.

YusenZhang826 avatar YusenZhang826 commented on May 29, 2024

Ok, I will learn more about the code, Thanks again!

from semimtr-text-recognition.

aaberdam avatar aaberdam commented on May 29, 2024

You welcome :)
I'm closing the issue. If you have additional questions, you can re-open it.
Aviad

from semimtr-text-recognition.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.