Giter Site home page Giter Site logo

mmnas's Introduction

MmNas - Deep Multimodal Neural Architecture Search

This repository corresponds to the PyTorch implementation of the MmNas for {Visual Question Answering, Visual Grounding, Image-Text Matching}. icon

Prerequisites

Software and Hardware Requirements

You may need a machine with at least 4 GPU (>= 8GB), 50GB memory for VQA and VGD and 150GB for ITM and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.

You should first install some necessary packages.

  1. Install Python >= 3.6

  2. Install Cuda >= 9.0 and cuDNN

  3. Install PyTorch >= 0.4.1 with CUDA (Pytorch 1.x is also supported).

  4. Install SpaCy and initialize the GloVe as follows:

    $ pip install -r requirements.txt
    $ wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
    $ pip install en_vectors_web_lg-2.1.0.tar.gz

Setup for VQA

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. You should place them as follows:

|-- data
	|-- coco_extract
	|  |-- train2014.tar.gz
	|  |-- val2014.tar.gz
	|  |-- test2015.tar.gz

Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:

  1. Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
  2. Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).

For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and place them as follow:

|-- datasets
	|-- vqa
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

After that, you should:

  1. Download the QA files for VQA-v2.
  2. Unzip the bottom-up features

Finally, the data folders will have the following structure:

|-- data
	|-- coco_extract
	|  |-- train2014
	|  |  |-- COCO_train2014_...jpg.npz
	|  |  |-- ...
	|  |-- val2014
	|  |  |-- COCO_val2014_...jpg.npz
	|  |  |-- ...
	|  |-- test2015
	|  |  |-- COCO_test2015_...jpg.npz
	|  |  |-- ...
	|-- vqa
	|  |-- v2_OpenEnded_mscoco_train2014_questions.json
	|  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |-- v2_OpenEnded_mscoco_test2015_questions.json
	|  |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
	|  |-- v2_mscoco_train2014_annotations.json
	|  |-- v2_mscoco_val2014_annotations.json
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

Setup for Visual Grounding

The image features are extracted using the bottom-up-attention strategy, with two types featrues are used: 1. visual genome(W/O reference images) pre-trained faster-rcnn detector; 2. coco pre-trained mask-rcnn detector following MAttNet. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.

Refs dataset{refcoco, refcoco+, recocog} were introduced here, build and place them as follow:

|-- data
	|-- vgd_coco
	|  |-- fix100
	|  |  |-- refcoco_unc
	|  |  |-- refcoco+_unc
	|  |  |-- refcocg_umd
	|-- detfeat100_woref
	|-- refs
	|  |-- refcoco
	|  |   |-- instances.json
	|  |   |-- refs(google).p
	|  |   |-- refs(unc).p
	|  |-- refcoco+
	|  |   |-- instances.json
	|  |   |-- refs(unc).p
	|  |-- refcocog
	|  |   |-- instances.json
	|  |   |-- refs(google).p

Additionally, it is also needed to bulid as follows:

cd mmnas/utils
python3 setup.py build
cp build/[lib.*/*.so] .
cd ../..

Setup for Image-Text Matching

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (fixed 36) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive and place in ./data folder.

Retrival dataset{flickr, coco} can be found here, extract and place them as follow:

|-- data
	|-- rois_resnet101_fix36
	|  |-- train2014
	|  |-- val2014
	|-- flickr_rois_resnet101_fix36
	|-- itm
	|  |-- coco_precomp
	|  |-- f30k_precomp

Training

The following script will start training with the default hyperparameters:

  1. VQA
$ python3 train_vqa.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vqa.json'
  1. VGD
$ python3 train_vgd.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_vgd.json'
  1. ITM
$ python3 train_itm.py --RUN='train' --GENO_PATH='./logs/ckpts/arch/train_itm.json'

To add:

  1. --VERSION=str, e.g.--VERSION='small_model' to assign a name for your this model.

  2. --GPU=str, e.g.--GPU='0, 1, 2, 3' to train the model on specified GPU device.

  3. --NW=int, e.g.--NW=8 to accelerate I/O speed.

  4. --MODEL={'small', 'large'} ( Warning: The large model will consume more GPU memory, maybe Multi-GPU Training and Gradient Accumulation can help if you want to train the model with limited GPU memory.)

  5. --SPLIT={'train', 'train+val', 'train+val+vg'} can combine the training datasets as you want. The default training split is 'train+val+vg'. Setting --SPLIT='train' will trigger the evaluation script to run the validation score after every epoch automatically.

  6. --RESUME to start training with saved checkpoint parameters.

  7. --GENO_PATH can use the different searched architectures.

Validation and Testing

Warning: If you train the model use --MODEL args or multi-gpu training, it should be also set in evaluation.

Offline Evaluation

It is a easy way to modify follows args: --RUN={'val', 'test'} --CKPT_PATH=[Your Model Path] to Run val or test Split.

Example:

$ python3 train_vqa.py --RUN='test' --CKPT_PATH=[Your Model Path] --GENO_PATH=[Searched Architecture Path]

Online Evaluation (ONLY FOR VQA)

Test Result files will stored in ./logs/ckpts/result_test/result_run_[Your Version].json

You can upload the obtained result json file to Eval AI to evaluate the scores on test-dev and test-std splits.

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@article{yu2020deep,
  title={Deep Multimodal Neural Architecture Search},
  author={Yu, Zhou and Cui, Yuhao and Yu, Jun and Wang, Meng and Tao, Dacheng and Tian, Qi},
  journal={arXiv preprint arXiv:2004.12070},
  year={2020}
}

mmnas's People

Contributors

cuiyuhao1996 avatar mil-vlg avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.