Comments (4)
Hi @HenryHZY,
- You can test on 4 GPUs instead of 8 GPUs, or make the
--batch_size
double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now. - The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information.
Thanks.
from univl.
Hi @HenryHZY,
- You can test on 4 GPUs instead of 8 GPUs, or make the
--batch_size
double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.- The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information.
Thanks.
@ArrowLuo Thanks for your quick reply!
Actually, I have also tested with 4 A100 GPUs. Double batch_size experiment with 8 A100 GPUs will be conducted later.
retrieval, FT-Align, 4 A100 GPUs
R@1: 0.2510 - R@5: 0.5780 - R@10: 0.7010 - Median R: 4.0
Maybe I need to change some parameters, such as epochs, batch_size and lr, to obtain a better result?
Do you have any other experience sharing on the fine-tuning experiment?
For example, just like your answer for #18, to increase the batch_size as much as possible to use my GPUs.
from univl.
Hi @HenryHZY, yes, the epochs, batch_size, and lr are important for the retrieval tasks. I can not remember other details/tricks to do fine-tuning now due to a long time away.
from univl.
Hi, @ArrowLuo. I would like to ask if the input of UniVL is video-sentences or clip-sentence or clip-sentences?
Following your instruction, I obtain the video features and text features.
Given a video_id_x that has a time interval [0, m-1 seconds], after feature extraction, video_id_x.npy is a np.array with a shape of [m, 1024].
Supposed that video_id_x has n video clips with n responding sentences. (defined in the caption.pickle)
"video_id_x":{
"start":[s_1, s_2, ..., s_n],
"end":[e_1, e_2, ..., e_n],
"text":["t_1", "t_2", ..., "t_n"]
}
/
/
Then, what is the shape of the original input tokens to UniVL? A single video clip and its one sentence?
Take the time interval [s_1, e_1] of the first video clip for an example:
video tokens: [e_1-s_1+1, 1024]
text tokens: [tokens_sum_of_t_1, word_token_embedding_size]
Are all the above data formats correct, including [m, 1024], [e_1-s_1+1, 1024] and [tokens_sum_of_t_1, word_token_embedding_size]?
Thanks for your time!
from univl.
Related Issues (20)
- How to fine-tune with additional layers before UniVL? HOT 2
- Run Without Distributed HOT 3
- TypeError: bad operand type for unary -: 'list' HOT 6
- How to run captioning task on my own video datasets? HOT 1
- Pre-training acceleration using multi-machine distributed training HOT 1
- Can you share your HowTo100M.csv file? HOT 3
- This repo is missing important files HOT 1
- Unable to run video captioning code HOT 3
- where to get transcript to generate youcookii_data.pickle HOT 2
- end-to-end video file captioning process HOT 3
- feature & data shape HOT 6
- How can I create my video feature pickle HOT 4
- video only test for youcook HOT 2
- How to only input text feature or video feature HOT 2
- Is there a code for Finetune on CMU-MOSI here? HOT 1
- Issues about Freezing some additional layers instead of meanP in CLIP4Clip HOT 2
- Error message (torch.distributed.elastic.multiprocessing.errors.ChildFailedError:)
- Estimate of zero-shot performance HOT 1
- Zero score (every output is None) on evaluation captioning with pretrained model HOT 1
- Non-Configurable GPU Count via Arguments
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from univl.