Giter Site home page Giter Site logo

j-min / vl-t5 Goto Github PK

View Code? Open in Web Editor NEW
354.0 10.0 58.0 867 KB

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

Home Page: https://arxiv.org/abs/2102.02779

License: MIT License

Python 74.20% Shell 1.29% Jupyter Notebook 24.51%
vision-and-language pretraining transformers vl-t5 vl-bart

vl-t5's Introduction

Unifying Vision-and-Language Tasks via Text Generation

teaser image

Setup

# Create python environment (optional)
conda create -n vlt5 python=3.7
source activate vlt5

# Install python dependencies
pip install -r requirements.txt

# Download T5/BART backbone checkpoint
python download_backbones.py

# For MSCOCO captioning evaluation (optional; for captioning only)
python -c "import language_evaluation; language_evaluation.download('coco')"

Code structure

# Store images, features, and annotations
./datasets
    COCO/
        images/
        featuers/
    VG/
        images/
        features/
    GQA/
        images/
        features/
    nlvr/
        images/
        features/
    RefCOCO/

    ...

# Run feature extraction
./feature_extraction

# Train VL-T5
./VL-T5/
    src/
        modeling_t5.py modeling_bart.py                       <= VL-T5/VL-BART model classes
        pretrain.py, pretrain_data.py, pretrain_model.py      <= pretraining
        vqa.py, vqa_data.py vqa_model.py ...                  <= fine-tuning on downstream tasks (ex. VQA, GQA, NLVR2)
        multitask.py, multitask_data.py multiask_model.py     <= multitask learning on 7 downstream tasks
        param.py                                              <= (argparse) configuration
        tokenization.py                                       <= custom tokenizer
        utils.py, dist_utils.py                               <= utility functions
    snap/                                                     <= store weight checkpoints
    scripts/                                                  <= bash scripts for pretraining and finetuning

API

import sys
sys.path.append('./VL-T5/src')

# Parse configuration
from param import parse_args
args = parse_args(
    backbone='t5-base' # Backbone architecture
    load='./snap/pretrain/VLT5/Epoch30' # Pretrained checkpoint
    parse=False, # False for interactive env (ex. jupyter)
)
# Assign GPU
args.gpu = 0

# Load data loaders
from vqa_data import get_loader
train_loader = get_loader(
    args,
    split=args.train,
    ...
)
val_loader = get_loader(
    args,
    split=args.valid,
    ...
)
test_loader = get_loader(
    args,
    split=args.test,
    ...
)

# Import trainer
from vqa import Trainer
trainer = Trainer(
    args,
    train_loader=train_loader
    val_loader=val_loader
    test_loader=test_loader,
)

# model is attached to trainer
model = trainer.model

# Each task-specific model class is inherited from VLT5/VLBart classes, which are inherited from Huggingface transformers T5/BART classes
print(model)
>>> VLT5VQA(
    (shared): Embedding(...)
    (encoder): JointEncoder(...)
    ...
)

# Training
train_batch = next(iter(train_loader))
model.train_step(train_batch)
>>> {'loss': ... }

# Inference
test_batch = next(iter(test_loader))
model.test_step(test_batch)
>>> {'pred_ans': ... }

To add a new task, you can start with writing 3 files by editing from existing ones.

NEW_TASK_model.py # Define a VLT5NewTask/VLBartNewTask model which inherits VLT5/VLBart class
NEW_TASK_data.py # Define Dataset/DataLoader/Evaluator
NEW_TASK.py # Define a trainer which inherits TrainerBase (trainer_base.py)

Download Pre-trained models / Pre-extracted features

We host model checkpoints and features via google drive. We recommend using gdrive to download them.

Pretrained Models

gdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursive

COCO+VG pretraining (default)

  • VL-T5/snap/pretrain/VLT5/Epoch30.pth: VL-T5 pretrained for 30 epochs on COCO+VG
  • VL-T5/snap/pretrain/VLBart/Epoch30.pth: VL-BART pretrained for 30 epochs on COCO+VG

VCR pretraining (2nd stage)

  • VL-T5/snap/vcr_pretrain/VLT5/Epoch20.pth: VL-T5 further pretrained for 20 epochs on VCR
  • VL-T5/snap/vcr_pretrain/VLBart/Epoch20.pth: VL-BART further pretrained for 20 epochs on VCR

Dataset Preparation / Feature extraction

gdrive download 1MBBhlkP83VMKS2Qe0SmFfzkHhMpIG5wf --recursive
  • Multi30K only
    • git clone --recursive https://github.com/multi30k/dataset ./datasets/multi30k-dataset
    • unzip train.en.gz, val.en.gz, test_2017_flickr.en.gz, test_2018_flickr.en.gz in ./datasets/multi30k-dataset/data/task1/raw/
    • unzip train.de.gz, val.de.gz, test_2017_flickr.de.gz, test_2018_flickr.de.gz in ./datasets/multi30k-dataset/data/task1/raw/
  • For manual feature extraction, please checkout ./feature_extraction

Pretraining on COCO+VG

# Pretraining with 4 gpus
cd VL-T5/
bash scripts/COCOVG_pretrain_VLT5.sh 4
bash scripts/COCOVG_pretrain_VLBart.sh 4

Downstream tasks

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/VQA_VLT5.sh 4
bash scripts/VQA_VLBart.sh 4
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/GQA_VLT5.sh 4
bash scripts/GQA_VLBart.sh 4
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/NLVR_VLT5.sh 4
bash scripts/NLVR_VLBart.sh 4
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4
bash scripts/RefCOCOG_VLBart.sh 4
# Pretraining on VCR with 4 gpus (optional)
cd VL-T5/
bash scripts/VCR_pretrain_VLT5.sh 4
bash scripts/VCR_pretrain_VLBart.sh 4

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/VCR_VLT5.sh 4
bash scripts/VCR_VLBart.sh 4
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/COCOCaption_VLT5.sh 4
bash scripts/COCOCaption_VLBart.sh 4
# Finetuning with 4 gpus
cd VL-T5/
bash scripts/Multi30K_VLT5.sh 4
bash scripts/Multi30K_VLBart.sh 4

Reference

Please cite our paper if you use our models in your works:

@inproceedings{cho2021vlt5,
  title     = {Unifying Vision-and-Language Tasks via Text Generation},
  author    = {Jaemin Cho and Jie Lei and Hao Tan and Mohit Bansal},
  booktitle = {ICML},
  year      = {2021}
}

vl-t5's People

Contributors

chenxwh avatar j-min avatar jnhwkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vl-t5's Issues

How to complete Multi-task Finetuning

Hi Jaemin,

Thanks for the very interesting paper.

I didn't see the handling method of multi-task finetuning in the code. I don't know whether I missed some of your code or you didn't release the relevant code

Error in provided Colab file

Hello,

When I run the Colab file that you provided, I get this error:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'T5Tokenizer'. The class this function is called from is 'VLT5TokenizerFast'.

TypeError** Traceback (most recent call last) <ipython-input-30-03619949772e> in <module>() ----> 1 tokenizer = VLT5TokenizerFast.from_pretrained('t5-base')

4 frames /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils.py in __init__(self, **kwargs)
338 339 def __init__(self, **kwargs): --> 340 super().__init__(**kwargs) 341 342 # Added tokens - We store this for both slow and fast tokenizers

TypeError: super(type, obj): obj must be an instance or subtype of type

For the line:

tokenizer = VLT5TokenizerFast.from_pretrained('t5-base')

How can I solve this? Thanks already!

transformers version-up

I noticed that the current ipynb demo is failing when installing packages of requirements.txt on colab with Python 3.10 recently. It seems that tokenizers (==0.9.4) which is paired with transformers==4.2.1 is not compatible with Python 3.10 in Colab.

Since I found that changing python version in colab is not trivial, now I'm considering package version-ups so that people can use VL-T5 with recent transformers versions >= 4.31. This might also address tokenization issue #21 relevant to the recent versions of transformers.

Current todos are:

  • update modeling_t5.py
    • T5Stack variable version ups (e.g., use_cache=None -> False / head_mask -> layer_head_mask)
  • check tokenization.py
    • add some test cases comparing tokenization outputs from the old 4.2.1 version and 4.31.0

Checkpoint on COCO captioning pretraining

Hi Jaemin,
Thanks for your great work! Both code and paper inspire me a lot. I want to do a little test and I want checkpoint finetuned on coco pretraining. Would you mind providing it? Thanks.

Some weights of VLT5VQA were not initialized from the model checkpoint at t5-base and are newly initialized

I'm running your VQA model on Google colab and I seem to get an error when loading the model weights:

Building Model at GPU 0
Some weights of VLT5VQA were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.layer_norm.weight'])
Model Launching at GPU 0
It took 3.0s

EDIT: This also happens when using the VLT5Pretraining:

Some weights of VLT5Pretraining were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight']

Text Generation Maximum Length

Thank you so much for this repo! It has been a pleasure to work with.

I am setting up a chart captioning finetuning task. My dataset contains pairs of chart images and chart scenegraphs (textual representations of the chart spec). I also have ground truth natural language captions.

I have finetuned your pretrained VLT5 model on my data. It is generating informative captions, but the generated captions are much shorter than the ground truth captions. The ground truth captions are on average 450 characters, whereas the generated captions are on average 181 characters.

Would you expect VLT5 to prefer short captions (i.e., because it was pretrained on short text)? Or would you expect I have a parameter set incorrectly? I have set gen_max_length = 512 and max_text_length = 512.

About the CPU memory leak

Hi jeamin,

Thanks for you interesting work.
I have been working with you codebase for the REG(referring expression generation) task.And I modify the fintune code to fit the REG task.And I find it works well.But recently,I find that , during training,the usage of CPU memory become more and more.I fintue for 20 epochs with 4 RTX 2080 and 32 cpus, it occupy about 26G memory in the begin and increase to 90G in 15 epoch,which cause the error:
RuntimeError: DataLoader worker (pid 28449) is killed by signal: Killed.
I didn't encounter this problem before,but i think the problem have alreay existed before,just the usage of cpu memory didn't exceed 90G,so I didn't find this proble.I don't know whether you encountered such problem or not?Do you have any suggestion about this?
I will appreciate very much if you can reply me as soon as possible!

Low test accuracy on Refcocog

Dear authors,
Thanks for your great work!
I fine-tuned a VL-T5 model on Refcocog, based on your released checkpoint but got a bad result: 20.45407
I noticed that in your paper the VL-BART diverges but not VL-T5.
Could you help take a look?

Thanks!

Generator-based model work better than the classifier-based model?

Hi!
First of all, thanks for such a great work : )

I just read your reply on #1 .
It is interesting.

you won't need to modify the vocabulary since T5's tokenizer is based on sentencepiece trained on a large corpus.
So, I have a question and
I was wondering wether the generator-based model work better than the classifier-based model? (like VQA task)
What do you think about this view?

有谁能提供一下karpathy 的in domain 和out of domain 的分割数据集?

您好,学者

我下载了Kapathy 数据集,发现应该是test 数据集由您论文写的划分方式,topk 个问题放在域内,其他归为域外(test数据26280=域内24722+域外1558)。所以这个划分的数据集有吗?24722 是哪些问题,1558 又是哪些问题?能否提供一下。

如果您提供了上面的划分,能否问一下论文中您公布的这个精度是怎么得到的,我想复现一下您的论文里 的结果关于 in domain 和 out of domain 。

是仅在训练集(605102)训练一个模型,保存参数,在test 做推断得到结果,还是在train+val这两个集合上做训练(605102+26729),然后保存参数,在test 做推断得到结果?

非常希望得到您的回复。

Error when finetuning RefCOCO with BART

Hi, by running bash scripts/RefCOCOg_VLBart.sh 1
I got the following error:

Original Traceback (most recent call last):
  File "/home/zuujhyt/miniconda3/envs/vlt5/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/zuujhyt/miniconda3/envs/vlt5/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zuujhyt/miniconda3/envs/vlt5/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/zuujhyt/test/VL-T5/VL-T5/src/refcoco_data.py", line 243, in __getitem__
    if self.args.vis_pointer:
AttributeError: 'Config' object has no attribute 'vis_pointer'

any advice?
thanks.

[Errno 32] Broken pipe

Hello,

I met the following error when i run bash scripts/VCR_pretrain_VLT5.sh 4:

1662021644244

I couldn't solve the problem. Need your help!

Thanks!

Extremely low zero-shot performance (0% acc on both val and test) on RefCOCOg

I downloaded the model weight pre-trained on VG&COCO and pre-processed features following the instruction in README. Then I tested the zero-shot grounding performance of VL-T5 on RefCOCOg dataset following the guidance. However the performance on val and test split are both zero, which really confuse me.

Then I tested the few-shot performance with VL-T5 and get reasonable result (44.53% acc on val split with four samples). I was wondering if it is the weight not used (see the log in below) when initializing RefCOCOModel from pre-trained weight that cause such big gap between the zero-shot performance and few-shot performance?

Command to Reproduce the Results

cd VL-T5/

# modify scripts/RefCOCOg_VLT5.sh to set the `lr` param to 0, set epoch to 1
vim scripts/RefCOCOg_VLT5.sh

# modify #304 of src/refcoco from `>` to `>=` to save the zero acc checkpoint for testing
vim src/refcoco.py

# run the training script
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4

Logs and Other Information

Log

Building Model at GPU 0
Building Model at GPU 3
Building Model at GPU 1
Building Model at GPU 2
Some weights of VLT5RefCOCO were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.visual_embedding.feat_embedding.0.weight', 'encoder.visual_embedding.feat_embedding.0.bias', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.0.bias', 'encoder.visual_embedding.obj_order_embedding.weight', 'encoder.visual_embedding.img_order_embedding.weight', 'encoder.visual_embedding.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Launching at GPU 3
Model Launching at GPU 1
Model Launching at GPU 2
Model loaded from  snap/pretrain/VLT5/Epoch30.pth
_IncompatibleKeys(missing_keys=[], unexpected_keys=['encoder.visual_embedding.feat_embedding.1.weight', 'encoder.visual_embedding.absolute_vis_pos_embedding.1.weight'])

Xnip2022-10-26_20-22-44

Script

Content of scripts/RefCOCOg_VLT5.sh (only lr and epochs params changed):

# The name of experiment
name=VLT5

output=snap/refcocog/$name

PYTHONPATH=$PYTHONPATH:./src \
python -m torch.distributed.launch \
    --nproc_per_node=$1 \
    src/refcoco.py \
        --distributed --multiGPU \
        --train train \
        --valid val \
        --test test \
        --optim adamw \
        --warmup_ratio 0.1 \
        --clip_grad_norm 5 \
        --lr 0e-5 \
        --epochs 1 \
        --num_workers 4 \
        --backbone 't5-base' \
        --output $output ${@:2} \
        --load snap/pretrain/VLT5/Epoch30 \
        --batch_size 90 \

Platform

OS: Ubuntu
GPU: A100

Inference on my own data?

Hello! First of all thank you so much for your work. I have read your paper and I want to carry out some open-ended VQA/answer generation VQA experiments with the model you proposed (VL-T5). However I am unsure where to start with the provided code. Would it be possible for you to provide an example code for extracting image features and text features for a custom dataset? (not data in VQA v2.0). I want to test if it it can generate answers based on images and questions that I have prepared.

Thank you so much and I am so sorry for troubling you.

Captioning off-the-shelf

Hello,

I am using pre-trained VL-T5 to generate captions for Flickr30K images off-the-shelf i.e. without any finetuning. I modified the captioning scripts to predict directly. I observe very short captions through, almost like noun phrases. I am including some examples below. I have played with the '--gen_max_length' and '--num_beams' parameters but I still get very short outputs. Do you have any ideas why this may be happening? Or any suggestions for how to potentially generate longer captions?

Thank you in advance!
Shruti

purple shirt
cutting cake
smiling
large group of people
skier

Hyperparameter Tuning Strategies

Hi Jaemin,

Thanks for the very interesting paper and releasing your codebase!

I have been working with your codebase for a different multimodal text generation task and observe lower performance with VL-T5 and VL-BART than other similar models. I think this might be a hyperparameter tuning issue. Do you have any advice on which particular parameters might be beneficial to tune? I am currently following the Multi30K settings for the learning rate and number of epochs from Table 14 in your paper.

a bug of VLT5TokenizerFast

When I use VLT5TokenizerFast to encode the sentence, there will be a token id 3 ( '▁') before id of token <extra_id_i>. For example,

from lib2to3.pgen2 import token
from tokenization import VLT5Tokenizer, VLT5TokenizerFast
from transformers import T5Tokenizer, BartTokenizer, T5TokenizerFast, BartTokenizerFast
from copy import deepcopy
import torch

tokenizer = VLT5TokenizerFast.from_pretrained(
            't5-base',
            max_length=20,
            do_lower_case=False,
            )

text = "I <extra_id_0> you."
input_ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(input_ids)
print(text)
print(input_ids)
print(decoded_text)
print(tokenizer.convert_ids_to_tokens([3]))


(base) zhangjiajie@node2:~/VL-T5-Incontext/VL-T5-Incontext/src$ python test.py
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'VLT5TokenizerFast'.
I <extra_id_0> you.
[27, 3, 32099, 25, 5, 1]
I <extra_id_0> you.</s>
['▁']

`

If I just use T5tokenizerFast, it is ok, and the output is

(base) zhangjiajie@node2:~/VL-T5-Incontext/VL-T5-Incontext/src$ python test.py
I <extra_id_0> you.
[27, 32099, 25, 5, 1]
I<extra_id_0> you.</s>
['▁']

Is there any solution? Thanks!

In domain and out of domain performance results from which datasets?

Dear scholar,

I found you refer to Kapathy vqa test.
It has three data file, train val test.
https://drive.google.com/drive/folders/1ZtuS7lsh_pZofOSiErwTM7lDUwdC3HAl
For in domain vqa test results , which dataset for training, validation and test?
For out of domain vqa test results, which dataset for training , validation and test?
I didn't find the details from the paper, only get
image

I want to reproduce the results. I hope for your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.