Comments (5)
Hi,
Thank you for the interest in our work.
The SeqCLR projection is applied on the working_layer
(see here), which can be backbone_feature
or feature
. In the latter case, the shape of the features is indeed (N, T, E)
(see here). If you work on the backbone_feature
then the shape is (N, E, H, W)
(see here). Therefore, in this case we first reshape them to become (N, H*W, E)
(see here).
To your specific suggestion, the goal of the projection is to linearly transform the features into a different (usually of lower-dimension) subspace. Therefore, we only want to change the number of the channels, i.e., E, and to preserve the spatial dimensions of H and W.
Let me know if you have following up questions,
Aviad
from semimtr-text-recognition.
Now I know the goal of the projection. Then I want to know how you handle the output of the visual output in the fine-tune stage(training with labeled data)? The tensor shape (N, E, H, W)
is reshaped to (N, H*W, E)
or (N, W, E*H)
before fed into the CTC or attention decoder? I think the latter(N, W, E*H)
is more common in text recognition tasks. But it is inconsistent(N, H*W, E)
with the pre-training process.
Looking forward to your response. Thanks!
from semimtr-text-recognition.
Hi,
In the vision model, there is a transformer unit which is applied after the backbone:
This 2D attention layer operates directly on the feature map of the size of
(N, T, H, W)
. It outputs a tensor of the shape of (N, T, E)
.To answer your question explicitly - We use 2D attention-based decoder and therefore we don't need the reshape that you mentioned for the supervised fine-tuning.
I hope that it's more clear now,
Aviad
from semimtr-text-recognition.
Ok, I will learn more about the code, Thanks again!
from semimtr-text-recognition.
You welcome :)
I'm closing the issue. If you have additional questions, you can re-open it.
Aviad
from semimtr-text-recognition.
Related Issues (14)
- Question about training time. HOT 1
- Question about frame-to-instance implemenation HOT 1
- semimtr_finetune without language model use HOT 2
- About EMA HOT 3
- I run !pip install -r requirements.txt then i config for mine then i run into this error.
- Is there an implementation of Seqclr? HOT 3
- Question about other languages support HOT 1
- NaN in input tensor HOT 2
- Why still calculate loss when conducting evaluation? and IndexError reported also HOT 5
- imgaug augmenters HOT 1
- Training stopped after 5 epochs while Semimtr finetuning HOT 2
- Evaluation results? HOT 1
- Image Size in One Batch. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from semimtr-text-recognition.