Giter Site home page Giter Site logo

henghuiding / vision-language-transformer Goto Github PK

View Code? Open in Web Editor NEW
334.0 4.0 20.0 330 KB

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

License: MIT License

Python 100.00%
vision-language transformer referring-segmentation tensorflow keras iccv2021 vision-language-transformer tpami

vision-language-transformer's Introduction

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{vision-language-transformer,
  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2021}
}

Introduction

Vision-Language Transformer (VLT) is a framework for referring segmentation task. Our method produces multiple query vector for one input language expression, and use each of them to “query” the input image, generating a set of responses. Then the network selectively aggregates these responses, in which queries that provide better comprehensions are spotlighted.

Installation

  1. Environment:

    • Python 3.6

    • tensorflow 1.15

    • Other dependencies in requirements.txt

    • SpaCy model for embedding:

      python -m spacy download en_vectors_web_lg

  2. Dataset preparation

    • Put the folder of COCO training set ("train2014") under data/images/.

    • Download the RefCOCO dataset from here and extract them to data/. Then run the script for data preparation under data/:

      cd data
      python data_process_v2.py --data_root . --output_dir data_v2 --dataset [refcoco/refcoco+/refcocog] --split [unc/umd/google] --generate_mask
      

Evaluating

  1. Download pretrained models & config files from here.

  2. In the config file, set:

    • evaluate_model: path to the pretrained weights
    • evaluate_set: path to the dataset for evaluation.
  3. Run

    python vlt.py test [PATH_TO_CONFIG_FILE]
    

Training

  1. Pretrained Backbones: We use the backbone weights proviede by MCN.

    Note: we use the backbone that excludes all images that appears in the val/test splits of RefCOCO, RefCOCO+ and RefCOCOg.

  2. Specify hyperparameters, dataset path and pretrained weight path in the configuration file. Please refer to the examples under /config, or config file of our pretrained models.

  3. Run

    python vlt.py train [PATH_TO_CONFIG_FILE]
    

Acknowledgement

We borrowed a lot of codes from MCN, keras-transformer, RefCOCO API and keras-yolo3. Thanks for their excellent works!

vision-language-transformer's People

Contributors

changliu19 avatar henghuiding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vision-language-transformer's Issues

Corresponding code for the Query Generation Module

image
Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)

I think the code for that module is defined in function lang_tf_enc of model/transformer_model.py

def lang_tf_enc(vision_input,
                lang_input,
                head_num=8,
                hidden_dim=256):
    decoder_embed_lang = TrigPosEmbedding(
        mode=TrigPosEmbedding.MODE_ADD,
        name='Fusion-Lang-Decoder-Embedding',
    )(lang_input)
    decoder_embed_vis = TrigPosEmbedding(
        mode=TrigPosEmbedding.MODE_ADD,
        name='Fusion-Vis-Decoder-Embedding',
    )(vision_input)
    q_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_vis)
    k_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
    v_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
    decoded_layer = MultiHeadAttention(head_num=head_num)(
        [q_inp, k_inp, v_inp])
    add_layer = L.Add(name='Fusion-Add')([decoded_layer, vision_input])

    return add_layer

As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/vlt_model.py):

def make_multitask_braches(Fv, fq, fq_word, config):
    # fq: bs, 1024
    # fq_word: bs, 15, 1024
    Fm = simple_fusion(Fv[0], fq, config.jemb_dim)  # 13, 13, 1024

    Fm_mid_query = up_proj_cat_proj(Fm, Fv[1], K.int_shape(Fv[1],)[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
    Fm_query = pool_proj_cat_proj(Fm_mid_query, Fv[2], K.int_shape(Fv[2])[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512

    Fm_mid_tf = proj_cat(Fm_query, Fm_mid_query, K.int_shape(Fm)[-1]//2)  # 26, 26, 1024
    F_tf = up_proj_cat_proj(Fm, Fm_mid_tf, K.int_shape(Fm)[-1] // 2)

    F_tf = V.DarknetConv2D_BN_Leaky(config.hidden_dim, (1, 1))(F_tf)

    # Fm_query:  bs, Hm, Wm, C  (None, 26, 26, 512)
    # Fm_top_tf :  bs, Hc, Wc, C  (None, 26, 26, 512)
    query_out = vlt_querynet(Fm_query, config)
    mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
    mask_out = vlt_postproc(mask_out, Fm_query, config)

    return mask_out

Can you tell me if I got it wrong? Thanks for your great patience.

swin transformer 做视觉特征提取器

您好,看到您论文在tpami2023中添加了swin transformer做视觉特征提取器的实验。可能是代码能力有限,我在keras中加载swin transformer模型有些许困难,不知道您可否公开一下加载swin transformer的代码呢,如果可以,十分感谢。

Reproducing on refcocog

Has anyone tried reproducing the code on refcocog dataset? My environment is python=3.7 and tensorflow=2.13 . I've succeeded on debugging errors caused by version change , but when I train the model on refcocog, with given base.yaml, my final validation seg_iou is around 28% .

Run this repository on the latest version or the same as the author's?

Hello,

for those who have successfully run this repository, I want to ask what's your version (e.g. python, Tensorflow). Is it the same as the author's or the latest?

My current virtual env is all the latest, Python 3.9, and Tensorflow 2.13.0. So there were many issues. So I modified some code and ran and modified.

But when I ran python vlt.py test ./config/refcoco/example.yaml, it occurred the following error on non-project files:
raise ValueError("%s is not compatible with %s" % types)
ValueError: GRU(reset_after=False) is not compatible with GRU(reset_after=True)
However, the latest (2.x) TensorFlow version modifies the relevant values.

Besides, for the Evaluating part, I downloaded the files, where do I put them? I noticed there's already an existing folder called config. In Step 2, does it mean to set OneDrive config files or existing files? What does the project structure look like?

Have you guys met such issues? Thank you in advance.
a1

some questions of environment error and refG's overIoU in the paper

Hello!
Recently, I have finished some experiments, the refG's overIoU of test(u) is lower than the VLT in our experiments, but I find the overIoU of test(u) is much higher than other algorithms. Thus, I have tried to run your code. Unfortunately, the code gave an error, and no workaround was found. The reproduction process is as follows :
step1:
git clone https://github.com/henghuiding/Vision-Language-Transformer
step 2:
conda create -n vlt python=3.6
conda activate vlt
pip install -r requirement.txt
step 3:
python vlt.py test pretrain/config.yaml

Errors:
File "vlt.py", line 57, in
tester = Tester(config, GPUS=GPU_COUNTS, debug=args.debug)
File "/home/users3/workspace/Vision-Language-Transformer/executor.py", line 184, in init
super(Tester, self).init(config, **kwargs)
File "/home/users3/workspace/Vision-Language-Transformer/executor.py", line 39, in init
self.yolo_model, self.yolo_body, self.yolo_body_single = self.create_model()
File "/home/users3/workspace/Vision-Language-Transformer/executor.py", line 54, in create_model
model_body = yolo_body(image_input, q_input, self.config)
File "/home/users3/workspace/Vision-Language-Transformer/model/vlt_model.py", line 162, in yolo_body
mask_out = make_multitask_braches(Fv, fq, fq_word, config)
File "/home/users3/workspace/Vision-Language-Transformer/model/vlt_model.py", line 79, in make_multitask_braches
mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
File "/home/users3/workspace/Vision-Language-Transformer/model/vlt_model.py", line 100, in vlt_transformer
head_num=config.transformer_head_num)
File "/home/users3/workspace/Vision-Language-Transformer/model/transfromer_model.py", line 79, in lang_tf_enc
)(lang_input)
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 881, in call
inputs, outputs, args, kwargs)
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2043, in set_connectivity_metadata
input_tensors=inputs, output_tensors=outputs, arguments=arguments)
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2059, in _add_inbound_node
input_tensors)
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in map_structure
structure[0], [func(*x) for x in entries],
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in
structure[0], [func(*x) for x in entries],
File "/home/users3/anaconda3/envs/vlt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2058, in
inbound_layers = nest.map_structure(lambda t: t._keras_history.layer,
AttributeError: 'tuple' object has no attribute 'layer'

If it is convenient, I would be grateful if you could provide the logs of your code running and testing on refG, or more detailed environment configuration, thanks.

About training speed.

Thanks for the great work! I wonder about the training speed? I tried to train the model on 2*V100 with batch size 256, but the training process is too slow and about 10 hours per epoch, is it normal?

question about loader.py

in loader.py, height and width of masks are half of the height and width of images. can I ask you why?

Problem: AttributeError: 'tuple' object has no attribute 'layer'

Whenever I try to train, test or whatever I get allways the same error, I have the code setup as you explain, but I keep getting the same error:

(ultimate) @fio:Vision-Language-Transformer> python vlt.py train config.yaml
Using TensorFlow backend.
batch_size: 128
embed_dim: 300
epoches: 50
evaluate_model: ./models/test_map.h5
evaluate_set: ./data/data_v2/anns/refcocog/val.json
free_body: 1
hidden_dim: 256
image_path: ./data/images/train2014
input_size: 416
jemb_dim: 1024
lang_att: True
log_images: 0
log_path: ./log/refcocog
lr: 0.001
lr_scheduler: step
max_queue_size: 10
multi_thres: False
num_query: 16
pretrained_weights: ./data/weights/yolov3_480000.h5
query_balance: True
rnn_bidirectional: True
rnn_drop_out: 0.1
rnn_hidden_size: 1024
seed: 10010
seg_gt_path: ./data/data_v2/masks/refcocog
seg_out_stride: 2
segment_thresh: 0.35
start_epoch: 0
steps: [40, 45, 50]
train_set: ./data/data_v2/anns/refcocog/train.json
transformer_decoder_num: 2
transformer_encoder_num: 2
transformer_head_num: 8
transformer_hidden_dim: 256
word_embed: en_vectors_web_lg
word_len: 20
workers: 32


--------------------------
PHASE:train

1 GPUs detected:
['/device:GPU:0']
Dataset Loaded: evaluate_set,  Len: 5000
Dataset Loaded: train_set,  Len: 44822
Creating model...
Traceback (most recent call last):
  File "vlt.py", line 54, in <module>
    trainer = Trainer(config, log_path, GPUS=GPU_COUNTS, debug=args.debug, verbose=args.verbose)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 117, in __init__
    super(Trainer, self).__init__(config, **kwargs)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 39, in __init__
    self.yolo_model, self.yolo_body, self.yolo_body_single = self.create_model()
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 54, in create_model
    model_body = yolo_body(image_input, q_input, self.config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 162, in yolo_body
    mask_out = make_multitask_braches(Fv, fq, fq_word, config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 79, in make_multitask_braches
    mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 100, in vlt_transformer
    head_num=config.transformer_head_num)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/transfromer_model.py", line 78, in lang_tf_enc
    )(lang_input)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 881, in __call__
    inputs, outputs, args, kwargs)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2043, in _set_connectivity_metadata_
    input_tensors=inputs, output_tensors=outputs, arguments=arguments)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2059, in _add_inbound_node
    input_tensors)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2058, in <lambda>
    inbound_layers = nest.map_structure(lambda t: t._keras_history.layer,
AttributeError: 'tuple' object has no attribute 'layer'

I have installed all the versions you specify with the requirements, and I am able to run the data_process_v2 script without any problem.
Do you know what is happening?

Thank you in advance,

Ferriol

Two questions about the model training: weight mismatch and yolov3_480000.h5

Hi,
I am running your code for model training, but I have two questions:

  1. How to generate the weights of yolov3_480000.h5. I try to generate the yolov3_480000.h5 follow the Keras implementation of YOLOv3 by Converting the Darknet YOLO model to a Keras model: python convert.py yolov3.cfg yolov3.weights model_data/yolo.h5. So is it right?
  2. After generating the yolov3_480000.h5, I start to train the model. But it shows some warming information, why?
    image

Confusion about data_process_v2

Hello, I just checked the file 'data/data_process_v2.py', and I found something confusing.

Since in line 98 you check 'if dataset == 'refclef', apparently, you take RefClef dataset into account, not only RefCoco, Refcoco+, Refcocog. But should categories in Refclef be processed the same way like Refcoco*, as in cat_process function? I guess the cat_process function is to convert COCO 91-category to 80-category. I wonder if this works to Refclef similarly?

By the way, still in line 98, why should ['19579.jpg', '17975.jpg', '19575.jpg'] be excluded? Is there any explanation?

Your reply would be highly appreciated, thanks :)

Test with own dataset

Thank you for your great work!

I am curious about testing with my own dataset.
I just want to get a mask image from my data.

Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.