Giter Site home page Giter Site logo

adv-inf's Introduction

Adversarial Inference for Multi-Sentence Video Descriptions

This is the implementation of Adversarial Inference for Multi-Sentence Video Descriptions

This repository is based on self-critical.pytorch. Thank you Ruotian for the code! The modifications are:

  • Training Multimodal Generator and Hybrid Discriminator in models/.
  • Adversarial Inference in eval_utils.py

Requirements

Clone the repository recursively. git clone --recursive https://github.com/jamespark3922/adv-inf

Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 0.4 (along with torchvision)
densevid_eval (for activitynet evaluation)
java to run meteor.jar file

Training on ActivityNet Dense Captions

Download ActivityNet captions and preprocess them

We share the input labels and features in this folder. (Scripts to preprocess the labels will be available soon.)

Features

  • renext101-64f (126GB) extracted from r3d repository
  • resnet152 (14GB), extracted 100 frames for each video
  • bottomup labels (16GB) with confidence score, extracted 3 frames for each clip

After downloading them all, unzip them to your preferred feature directory.

Note that mean-pooling operations are done when loading the data in dataloader.py

Training

python train.py --caption_model video --input_json activity_net/inputs/video_data_dense.json --input_fc_dir activity_net/feats/resnext101-64f/ --input_img_dir activity_net/feats/resnet152/ --input_box_dir activity_net/feats/bottomup/ --input_label_h5 activity_net/inputs/video_data_dense_label.h5 --glove_npy activity_net/inputs/glove.npy --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path video_ckpt --val_videos_use -1 --losses_print_every 10 --batch_size 16 --language_eval 1

Context: The generator model uses the hidden state of previous sentence as "context", starting at epoch --g_context_epoch.

Evaluation

After training is done, evaluate the captions in paragraph level. Note the evaluation is done on val1 set.

The normal inference using greedymax or beamsearch can be run with the following command:

python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 1 --id $id --beam_size $beam_size

and will be saved in densevid_eval/caption_$id.json. You can also disable --d_model_path if you do not wish to score and evaluate the discriminator.

Adversarial Inference

Sampling $num_samples sentences and choosing the best one with discriminator can be run with

python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 0 --num_samples $num_samples --temperature $temperature --id $id

Generated Catpions

You can run the language metrics to reproduce the results

python para-evaluate.py -s $submission_file --verbose

and the diversity metrics (Div-N, Re-N) in paper.

python evaluateCaptionsDiversity.py $submission_file

Reference

@article{park2019advinf,
  title= Adversarial Inference for Multi-Sentence Video Descriptions,
  author={Park, Jae Sung and Rohrbach, Marcus and Darrell, Trevor and Rohrbach, Anna},
  jorunal={CVPR 2019},
  year={2019}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.