Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

These part of baseline results are copied from Table 4 in <a href="https://arxiv.org/

These part of baseline results are copied from Table 4 in <a href="https:

Why is the fine-tuning performance much lower than benchmark in paper? about univl HOT 8 CLOSED

microsoft commented on August 16, 2024

Why is the fine-tuning performance much lower than benchmark in paper?

from univl.

Comments (8)

ArrowLuo commented on August 16, 2024

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

from univl.

lokeaichirou commented on August 16, 2024

Hi, I guess your results are obtained from YoucookII. If you run on the youcookii_data.no_transcript.pickle directly, the scores you got are right under the condition of your hyperparameters. Our best scores are generated with the transcript. The youcookii_data.no_transcript.pickle is a version without the transcript. Read the readme file to get the below information:

If using video only as input (youcookii_data.no_transcript.pickle),
The results are close to
BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

You should compare your scores with the third line from the bottom in Table 3 in our paper.

Hi, @ArrowLuo , thanks for your information. In fact, in Table 3 of the paper, the scores for single V as input are: B-3: 16.46, B_4: 11.17, M: 17.57, R-L: 40.09, CIDEr: 1.27. It is much larger than these: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725. So may I know is the above scores obtained by better hyper-parameter setting or something else?

from univl.

ArrowLuo commented on August 16, 2024

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

from univl.

lokeaichirou commented on August 16, 2024

Sorry for the confusion. These metrics are printed with real value and reported with percentages in the paper (except CIDEr). So your scores are right.

Hi, @ArrowLuo , so the metrics printed by program is correct, so may I know is all the metrics in Table 3 in paper are normalized? meaning metrics for all the models involved in table. And how do you normalize it? Because I am lack of sense on the performance by these pre-normalizaed metrics: BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049 , sicne I desire to make a comparison with those in paper. Many thanks~

from univl.

ArrowLuo commented on August 16, 2024

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

from univl.

lokeaichirou commented on August 16, 2024

Not normalization operation. Just need to multiply 100 on these metrics (except for CIDEr).
For example, the print is:

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117, METEOR: 0.1769, ROUGE_L: 0.4049

the scores after multiplying 100:

BLEU_1: 39.21, BLEU_2: 25.22, BLEU_3: 16.55, BLEU_4: 11.17, METEOR: 17.69, ROUGE_L: 40.49

Hi, @ArrowLuo , many thanks! got it. However, I found there is a mismatch of metrics scores in Univl paper with E2E masked transformer for dense video captioning original paper. In their paper, their scores are:
Method GT Proposals Learned Proposals
B4 M B4 M
Bi-LSTM +TempoAttn 0.87 8.15 0.08 4.62
Our Method 1.42 11.20 0.30 6.58

While in Univl paper, Table 3, B4 : 4.38, M:11.55 for E2E masked transformer. Is this result obtained by your experiments based on their released model, and utilizing ground-truth proposals during inference?

from univl.

ArrowLuo commented on August 16, 2024

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

from univl.

lokeaichirou commented on August 16, 2024

These part of baseline results are copied from Table 4 in https://arxiv.org/pdf/1906.05743.pdf. I notice that it is indeed different from the original paper.

Ok, I see, thanks!

from univl.

Why is the fine-tuning performance much lower than benchmark in paper? about univl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent