Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Questions on retrieval result and "Info: Weight doesn't exsits" about univl HOT 4 CLOSED

microsoft commented on August 16, 2024

Questions on retrieval result and "Info: Weight doesn't exsits"

from univl.

Comments (4)

ArrowLuo commented on August 16, 2024

Hi @HenryHZY,

You can test on 4 GPUs instead of 8 GPUs, or make the --batch_size double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.
The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information.
Thanks.

from univl.

HenryHZY commented on August 16, 2024

Hi @HenryHZY,

You can test on 4 GPUs instead of 8 GPUs, or make the --batch_size double when using 8 GPUs. Then we can discuss the results. I am not sure what affects the performance now.

The log of these three lines is redundant, and does not affect the pretrain, train, and inference. Just ignore them, or regard them as dirty information.
Thanks.

@ArrowLuo Thanks for your quick reply!
Actually, I have also tested with 4 A100 GPUs. Double batch_size experiment with 8 A100 GPUs will be conducted later.

retrieval, FT-Align, 4 A100 GPUs
R@1: 0.2510 - R@5: 0.5780 - R@10: 0.7010 - Median R: 4.0

Maybe I need to change some parameters, such as epochs, batch_size and lr, to obtain a better result?

Do you have any other experience sharing on the fine-tuning experiment?
For example, just like your answer for #18, to increase the batch_size as much as possible to use my GPUs.

from univl.

ArrowLuo commented on August 16, 2024

Hi @HenryHZY, yes, the epochs, batch_size, and lr are important for the retrieval tasks. I can not remember other details/tricks to do fine-tuning now due to a long time away.

from univl.

HenryHZY commented on August 16, 2024

Hi, @ArrowLuo. I would like to ask if the input of UniVL is video-sentences or clip-sentence or clip-sentences?

Following your instruction, I obtain the video features and text features.
Given a video_id_x that has a time interval [0, m-1 seconds], after feature extraction, video_id_x.npy is a np.array with a shape of [m, 1024].

Supposed that video_id_x has n video clips with n responding sentences. (defined in the caption.pickle)

"video_id_x":{
		"start":[s_1, s_2, ..., s_n],
		"end":[e_1, e_2, ..., e_n],
		"text":["t_1", "t_2", ..., "t_n"]
	}

/
/

Then, what is the shape of the original input tokens to UniVL? A single video clip and its one sentence?
Take the time interval [s_1, e_1] of the first video clip for an example:

video tokens: [e_1-s_1+1, 1024]
text tokens: [tokens_sum_of_t_1, word_token_embedding_size]

Are all the above data formats correct, including [m, 1024], [e_1-s_1+1, 1024] and [tokens_sum_of_t_1, word_token_embedding_size]?

Thanks for your time!

from univl.

Questions on retrieval result and "Info: Weight doesn't exsits" about univl HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent