Giter Site home page Giter Site logo

meter's Introduction

METER: A Multimodal End-to-end TransformER Framework

Install

pip install -r requirements.txt
pip install -e .

Pre-trained Checkpoints

Here are the pre-trained models:

  1. METER-CLIP16-RoBERTa (resolution: 288^2) pre-trained on GCC+SBU+COCO+VG link
  2. METER-CLIP16-RoBERTa (resolution: 224^2) pre-trained on GCC+SBU+COCO+VG link
  3. METER-SwinBase-RoBERTa (resolution: 384^2) pre-trained on GCC+SBU+COCO+VG link
  4. METER-CLIP16-RoBERTa fine-tuned on VQAv2 (resolution: 576^2) link
  5. METER-CLIP16-RoBERTa fine-tuned on NLVR2 (resolution: 288^2) link
  6. METER-CLIP16-RoBERTa fine-tuned on SNLI-VE (resolution: 384^2) link
  7. METER-CLIP16-RoBERTa fine-tuned on Flickr30k IR/TR (resolution: 384^2) link
  8. METER-CLIP16-RoBERTa fine-tuned on COCO IR/TR (resolution: 384^2) link

Dataset Preparation

We follow ViLT and use pyarrow to serialize the datasets. See this link for details.

Pre-training

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_mlm_itm_clip_bert per_gpu_batchsize=32 clip16 text_roberta image_size=288

Fine-tuning on Downstream Tasks

VQAv2

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_vqa_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> <IMAGE_AUGMENTATION>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_vqa_clip_bert per_gpu_batchsize=32 load_path=meter_pretrain.ckpt clip16 text_roberta image_size=576 clip_randaug 

Flickr30k IR/TR

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_f30k_clip_bert get_recall_metric=False per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> <IMAGE_AUGMENTATION>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_irtr_f30k_clip_bert get_recall_metric=False per_gpu_batchsize=32 load_path=meter_pretrain.ckpt clip16 text_roberta image_size=384 clip_randaug 

COCO IR/TR

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_coco_clip_bert get_recall_metric=False per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> <IMAGE_AUGMENTATION>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_irtr_coco_clip_bert get_recall_metric=False per_gpu_batchsize=32 load_path=meter_pretrain.ckpt clip16 text_roberta image_size=384 clip_randaug 

NLVR2

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES>  task_finetune_nlvr2_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> <IMAGE_AUGMENTATION>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1  task_finetune_nlvr2_clip_bert per_gpu_batchsize=32 load_path=meter_pretrain.ckpt clip16 text_roberta image_size=288 clip_randaug 

SNLI-VE

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_snli_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> <IMAGE_AUGMENTATION>

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_snli_clip_bert per_gpu_batchsize=8 load_path=meter_pretrain.ckpt clip16 text_roberta image_size=384 clip_randaug 

Evaluation on Downstream Tasks

VQAv2

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_vqa_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> test_only=True

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_vqa_clip_bert per_gpu_batchsize=32 load_path=meter_vqa.ckpt clip16 text_roberta image_size=576 test_only=True

Then, submit the json file in the result directory to eval.ai evaluation server to get the test-dev and/or test-std scores.

Flickr30k IR/TR

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_f30k_clip_bert get_recall_metric=True per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> test_only=True

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_irtr_f30k_clip_bert get_recall_metric=True per_gpu_batchsize=32 load_path=meter_f30k.ckpt clip16 text_roberta image_size=384 test_only=True

The returned values are IR R@1, R@5, R@10 and TR R@1, R@5, R@10.

COCO IR/TR

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_coco_clip_bert get_recall_metric=True per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> test_only=True

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_irtr_coco_clip_bert get_recall_metric=True per_gpu_batchsize=32 load_path=meter_coco.ckpt clip16 text_roberta image_size=384 test_only=True

The returned values are IR R@1, R@5, R@10 and TR R@1, R@5, R@10.

NLVR2

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES>  task_finetune_nlvr2_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> test_only=True

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1  task_finetune_nlvr2_clip_bert per_gpu_batchsize=32 load_path=meter_nlvr2.ckpt clip16 text_roberta image_size=288 test_only=True

SNLI-VE

export MASTER_ADDR=$DIST_0_IP
export MASTER_PORT=$DIST_0_PORT
export NODE_RANK=$DIST_RANK
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_snli_clip_bert per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> <IMAGE_ENCODER> <TEXT_ENCODER> image_size=<IMAGE_SIZE> test_only=True

Here is an example:

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_finetune_snli_clip_bert per_gpu_batchsize=8 load_path=meter_snli.ckpt clip16 text_roberta image_size=384 test_only=True

Citation

@inproceedings{dou2022meter,
  title={An Empirical Study of Training End-to-End Vision-and-Language Transformers},
  author={Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Zhang, Pengchuan and Yuan, Lu and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
  booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022},
  url={https://arxiv.org/abs/2111.02387},
}

Acknowledgements

The code is based on ViLT licensed under Apache 2.0 and some of the code is borrowed from CLIP and Swin-Transformer.

meter's People

Contributors

zdou0830 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

meter's Issues

Why the test results are different using same data?

I used pl.seed_everything to set seed,

pl.seed_everything(_config["seed"], workers=True)

but I still got different result when I tested flickr30k Image2Text Retrieval task on the model trained by myself.
First:

(tensor(0.7382),  tensor(0.9274), tensor(0.9638), tensor(0.8965), tensor(0.9814), tensor(0.9941)) 0)

Second:

(tensor(0.7366), tensor(0.9294), tensor(0.9656), tensor(0.8975), tensor(0.9814), tensor(0.9941)) 0

I ensure the config files are same.
Do you meet this problem?

gpu memory unbalance

Hello~thank you for your open source code.
When I run METER on 8 a100, I find 1 gpu'memory is 93% and others are 61%. There are 7 other processes on the first card, which reults in memory unbalance. But if we use ddp, the memory should be balance. Have you met this problem?
And I also ran VILT, which has no this problem.
Thank you!

MLM task without visual input

Hi, thanks for the code! According to your paper, the model is trained to reconstruct the original tokens given the
masked tokens and its corresponding visual input v in MLM. However, in the compute_mlm() function, the prediction is not conditioned on visual inputs. I wonder whether the visual input can help the pretraining.

Is P100 okay?

Hello, thank you for your service. Can I use 4 P100(16G ) fine-tuning VQA?

pretraining task

Hello, the author, great work! I'm curious whether you have tried to add Image Text Contrast Learning in the pretraining task? Because in the ALBEF paper, they reported that the ITC task had a great impact on the experimental results.

Resume pre-training using "resume_from_checkpoint" in pytorch lightning

Hi, thank you for sharing the code! I have a question regarding how to resume the pre-training. My pre-training stopped halfway and I want to resume the pre-training from the checkpoint saved. Have you tried to resume pre-training from a checkpoint when doing experiments? And may I know what is the method you used? I have tried to set the directory "resume_from=" in the config.py file (which sets the "resume_from_checkpoint" parameter in PyTorch lightning trainer) to resume training, but I found that the pertaining loss started to increase instead of decrease after one epoch of pre-training. Does this issue have something to do with the use of DDP? Thank you!

Pretrained weights of CLIP-ViT-224/32

Hi,

Thanks for the code! I wonder if you plan to release the pretrained weights of CLIP-ViT-224/32 (e.g., METER-CLIP32-RoBERTa (resolution: 224^2) pre-trained on GCC+SBU+COCO+VG)? It would be helpful for those who want to play with your model but don't have enough computational resources. Thanks!

The results of COCO IR/TR

I have already run the evaluation task of IR/TR in coco dataset with your provided example in the cmd line. However, the returned values of IR R@1, R@5, R@10 and TR R@1, R@5, R@10 are not found in the cmd. After running the evaluation, I just get the events.out. tfevents.... which is a Binary files.
How can I get the values of IR R@1, R@5, R@10 and TR R@1, R@5, R@10?

Image Caption Script

Hi author, thanks for your work. Can you provide an example script for the image caption downstream task?

Thanks.

Citation typos in paper

Found some wrong citations in your arXiv paper, e.g., in Table 1 VL-BERT is actually [50] but wrongly referred to as [30] and OSCAR is actually [33] but wrongly referred to as [7].

question about whole_work_mask

Hi, it seems that whole_word_mask is unsuitable for roberta, yet in your example script

python run.py with data_root=/data2/dsets/dataset num_gpus=8 num_nodes=1 task_mlm_itm_clip_bert per_gpu_batchsize=32 clip16 text_roberta image_size=288

I wonder whether there is performance drop when setting whole_word_mask=True for text_roberta.

Question about Table 5 in the paper

Hi,
as stated in the first row of table 5,
the TR for Flickr-ZS is 90.38 which is highest among the column but not in bold.
Is this a lapse or the real result ?

By the way, in my understanding, the text embedding in emb-only method is obtained from BERT embedding module without encoder module, it contains no contextual information. Do I understand it correctly?

ValueError and AttributeError

Hi,
Iโ€˜m trying to making "run.py" work for Pre-training, but I got ValueError and AttributeError, and I didn't find a solution, can you help me to check it? Thank you very much!

Traceback (most recent call last):
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 20, in main
dm = MTDataModule(_config, dist=True)
File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/multitask_datamodule.py", line 19, in init
self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys}
File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/multitask_datamodule.py", line 19, in
self.dm_dicts = {key: _datamoduleskey for key in datamodule_keys}
File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/coco_caption_karpathy_datamodule.py", line 7, in init
super().init(*args, **kwargs)
File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/datamodule_base.py", line 60, in init
self.tokenizer = get_pretrained_tokenizer(tokenizer)
File "/home/T3090U3/PycharmProjects/hxf/METER/METER-main/meter/datamodules/datamodule_base.py", line 25, in get_pretrained_tokenizer
return RobertaTokenizer.from_pretrained(from_pretrained)
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1672, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/file_utils.py", line 1271, in cached_path
output_path = get_from_cache(
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/transformers/file_utils.py", line 1494, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 16, in
def main(_config):
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/home/T3090U3/anaconda3/envs/hxf/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

The training set of using different pretraining datasets.

When I tried to reproduce the results in Table 17, I found that using the default learning rate and only using the coco pertaining dataset worked extremely poorly on downstream tasks.

So, I would like to ask, do you set different training parameters (eg, lr, bs, max epoch, etc) for different pre-training datasets?

training steps

Hello, the author. When I read the paper, I found that when the encoder is clip32, 50000 steps were trained, and when the encoder is clip16, 100000 steps were trained. Is it because the convergence time of clip32 is shorter than that of clip16, or just to reduce the training time?

Pre-trained models for the Merged Attention Model?

Thanks for the amazing repository. The code is really clean. If I understand correctly, the current implementation is co-attention model, and same for pre-trained weights. I wanted to know if you had plans to release the merge attention model weights as well! Thanks in advance!

result difference

Hi, could you explain what is the difference between the 'RoBERTa-CLIP-16' in Table 4 and the 'METER-CLIP-ViT_BASE' in Table 7?

Performance drop when using DDP

Hi! Thanks for your great job! i tried to pretrain the model on multi node,multi gpus (8 * 8 gpus as vilt did), and observed a performance drop when finetune in VQA. Is there any config difference between pre-training in single node, 1 * 8 gpus as readme recommended, and multi node(n*8 gpus) ?

Unable to train models faster with more gpus

Hi,
I am facing an issue where, on increasing the number of gpus and nodes, the number of steps for each epoch doesnot change. for eg if I run

python run.py with data_root=/data/datasets/meter_data_combined num_gpus=4 num_nodes=8 task_mlm_itm_clip_bert per_gpu_batchsize=64 clip16 text_roberta image_size=224 precision=16 datasets='["vg"]'

the number of steps per epoch is nearly 150k. I observe that the number of steps is 150k when num_gpus=1 num_nodes=1, and when num_gpus=4 num_nodes=8. I made sure that all gpus were being utilized when I set num_gpus=4 num_nodes=8. I also observe that while using num_gpus=4 num_nodes=8, the time for each epoch is ~160 hours in my case, while it is ~30 hours if I set num_gpus=1 num_nodes=1.

Is there any suggestion that you have for this problem?

SNLI dataset download

Hello~ In VILT github repo, there is no snli dataset download and deal details. So is there any snli dataset preperation guideline?

Some questions for the paper

What is the difference between the score in Table 5 and Table 8?
77.19 in Table 5 results on test-dev set of VQAv2, and,
77.68 in Table 8 results on test-dev set of VQAv2.

questions about VQA

Hi, could you share the VQAv2 result fine-tuning with image resolution of 384, the result implemented by me is 76.52 and it is based on your checkpoint pretrained on COCO, SBU, VG, CC3M.

> Hello, you may check 'Evaluation on Downstream Tasks' in README.md and find whether it's what you want? ![image](https://user-images.githubusercontent.com/71176040/201471719-fc32a022-201d-4c9d-acee-03d39451da56.png)

    > Hello, you may check 'Evaluation on Downstream Tasks' in README.md and find whether it's what you want? ![image](https://user-images.githubusercontent.com/71176040/201471719-fc32a022-201d-4c9d-acee-03d39451da56.png)

Hi Ferry,
Thank you so much for your reply. Sorry that I didn't explain clearly about the task.
The image caption task is the task mentioned in your supplementary file, where the evaluation metrics are BLEU, METEOR, Cider and SPICE.

Originally posted by @Markin-Wang in #32 (comment)

input size of irtr_rank_output

Honourable authors:
Thanks for your excellent jobs! I have 2 questions when I did itrt task.

  1. First, I found the output size of infer["cls_feats"] in compute_irtr function (line 266) is twice of hidden_size, but the in_features of pl_module.rank_output function is same as hidden_size, how to understand it?
  2. Second, why add token_type_embedding for text_embeds and image_embeds in infer function(line 203)?

GPU OOM when pretraining

HI, I'm trying to pre-train the METER by using 8 A100 GPUS with the recommended config:

python run.py with num_gpus=8 num_nodes=1 task_mlm_itm_clip_bert per_gpu_batchsize=32 clip16 text_roberta image_size=288

but the GPU OOM occurred.

So what is the extract per_gpu_batchsize? And how can I pre-train the model in about 8 days as mentioned in the paper.

By the way, will the mixed precision training (precision=16) cause a performance drop?

Many thanks!

questions about vqa

Hi, what the image_size used for VQAv2 in Table 2 and Table 3 in your paper?

How much is the per gpu batch size?

How much is the per gpu batch size?
total batchsize is 4096, GPU num is 8, so per gpu batch size is 512?
But I use A100 GPU, the batch size only can be set 16?

The config setting of swin transformer

hello, thanks for your open code of METER, which has many valuable experiments.
The code uses clip-vit to extract visual features. If I want to utilize swin transformer, what configs should I set? Thank you again~

It is too slow for irtr

Hi authors,
Thanks for your great work!
I am trying to reproduce the results but found it is too slow for irtr testing.
It seems that it needs 38 hours for inference on advanced computing architectures (like A100), specifically it is the rank loop that costs so much time.
It is normal?

Thanks!

Inference with Fine-tuned SNLI Model

Hi,

Thank you for the great work and the fine-tuned models, but I just wanted to ask how I should go about running inference with the fine-tuned model. Currently, I run into this error in my notebook:

1 model = METERTransformerSS(cfg)
----> 2 model.load_state_dict(torch.load("/content/meter_clip16_288_roberta_snli.ckpt")['state_dict'])

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in load_state_dict(self, state_dict, strict)
   1050         if len(error_msgs) > 0:
   1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1054 

RuntimeError: Error(s) in loading state_dict for METERTransformerSS:
	Unexpected key(s) in state_dict: "vit_model.token_embedding.weight". 
	size mismatch for vit_model.visual.positional_embedding: copying a param with shape torch.Size([577, 768]) from checkpoint, the shape in current model is torch.Size([197, 768]).

I wonder if this is due to how I configure the model or not, is there a specific way I should create the config for inference? Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.