Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg about vl-t5 HOT 1 OPEN

j-min commented on August 22, 2024

Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg

from vl-t5.

Comments (1)

yiranyyu commented on August 22, 2024

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu GPU: A100

Update:

It seems the unexpected_keys warning is not the reason of this low performance. The unexpected_keys message disappears when I use the model further pretrained on VCR, however the val and test performance is still low (i.e. nearly 0.6% on val and test). Then we try to constrain the decoding and only generate vis_extra_id_ tokens, resulting a 1% accuracy on test.

from vl-t5.

Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg about vl-t5 HOT 1 OPEN

Comments (1)

Command to Reproduce the Results

Logs and Other Information

Log

Script

Platform

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent