facebookresearch / listen-to-look Goto Github PK

Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)

License: Creative Commons Attribution 4.0 International

Python 100.00%

listen-to-look's Introduction

Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)

Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao^1,2, Tae-Hyun Oh², Kristen Grauman^1,2, Lorenzo Torresani²
¹UT Austin, ²Facebook AI Research
In Conference on Computer Vision and Pattern Recognition (CVPR), 2020

If you find our code or project useful in your research, please cite:

@inproceedings{gao2020listentolook,
  title = {Listen to Look: Action Recognition by Previewing Audio},
  author = {Gao, Ruohan and Oh, Tae-Hyun, and Grauman, Kristen and Torresani, Lorenzo},
  booktitle = {CVPR},
  year = {2020}
}

Preparation

The image features, audio features, and image-audio features for ActivityNet are shared at this link. After IMGAUD2VID distillation on Kinetics, we fine-tune the image-audio network for action classification on AcitivityNet. The image features, audio features, and image-audio features after the fusion layer (see Fig.2 in the paper) are extracted from the fine-tuned image-audio network. The image-audio model fine-tuned on ActivityNet and the pickle files of the paths to the image-audio features are also shared. The features can also be downloaded using the commands below:

wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/image_features.tar.gz
wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/audio_features.tar.gz
wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/imageAudio_features.tar.gz

Training and Testing

(The code has beed tested under the following system environment: Ubuntu 18.04.3 LTS, CUDA 10.0, Python 3.7.3, PyTorch 1.0.1)

Download the extracted features and the fine-tuned image-audio model for ActivityNet, and prepare the pickle file accordingly by changing to paths to have the correct root prefix of your own.
Use the following command to train the video preview model:

python main.py \
--train_dataset_file '/your_pickle_file_root_path/train.pkl' \
--test_dataset_file '/your_pickle_file_root_path/val.pkl' \
--batch_size 256 \
--warmup_epochs 0 \
--epochs 25 \
--lr 0.01 \
--milestones 15 20 \
--momentum 0.9 \
--decode_threads 10 \
--scheduler \
--num_classes 200 \
--weights_audioImageModel '/your_model_root_path/ImageAudioNet_ActivityNet.pth' \
--checkpoint_freq 10 \
--episode_length 10 \
--checkpoint_path './checkpoints/exp' \
--freeze_imageAudioNet \
--with_avgpool_ce_loss \
--compute_mAP \
--mean_feature_as_start \
--subsample_factor 1 \
--with_replacement |& tee -a logs/exp.log

Use the following command to test your trained model:

python validate.py \
--test_dataset_file '/your_pickle_file_root_path/val.pkl' \
--batch_size 256 \
--decode_threads 10 \
--scheduler \
--num_classes 200 \
--episode_length 10 \
--pretrained_model './checkpoints/exp/model_final.pth' \
--with_replacement \
--mean_feature_as_start \
--feature_interpolate \
--subsample_factor 1 \
--compute_mAP

The single modality variant of our model is shared under listen_to_look_single_modality. The r2plus1d152 features for ActivityNet can be downloaded using the command below:

wget http://dl.fbaipublicfiles.com/rhgao/ListenToLook/r2plus1d152_features.tar.gz

Acknowlegements

Portions of the code are borrowed or adapted from Bruno Korbar and Zuxuan Wu. Thanks for their help!

Licence

The code for Listen to Look is CC BY 4.0 licensed, as found in the LICENSE file.

listen-to-look's People

Stargazers

Watchers

Forkers

peterzs munendervarshney toanhvu lk-greenbird flaber123 codeboy5 yunniyyy asdlei99 peterzhousz yulengchuanjiang yeboqxc noammy yqgao716 gftww gaurijagatap

listen-to-look's Issues

Mini-Sports1M feature

Dear Doctor:
How can I get the Mini-Sports1M feature just like the ActivityNet feature you have provided?
I can't get the feature extractor and the fusion layer weight for Mini-Sports 1M.Please tell me.
Thanks!
ctx

ImageAudioNet Activitynet.pth

Hi，Where to download ImageAudioNet Activitynet.pth？

Both streams generate an output of 256 dimensions and thus the concatenated representations yield an image-audio embedding of 512 dimensions.

In the /models/imageAudio_model.py
class ImageAudioModel(torch.nn.Module):
def name(self):
return 'ImageAudioModel'

def __init__(self):
    super(ImageAudioModel, self).__init__()
    #initialize model
    self.imageAudio_fc1 = torch.nn.Linear(512 * 2, 512 * 2)
    self.imageAudio_fc1.apply(networks.weights_init)
    self.imageAudio_fc2 = torch.nn.Linear(512 * 2, 512)
    self.imageAudio_fc2.apply(networks.weights_init)

原文中写道两端都是256维向量，拼接成512维，而此处写的512*2，这样是不是两端各输入的512维向量

RuntimeError: the derivative for 'indices' is not implemented

python3 main.py --train_dataset_file /scratch/supreeth/listen_to_look/listen_to_look/train.pkl --test_dataset_file /scratch/supreeth/listen_to_look/listen_to_look/val.pkl --batch_size 32 --warmup_epochs 0 --epochs 25 --lr 0.01 --milestones 15 20 --momentum 0.9 --decode_threads 4 --scheduler --num_classes 200 --weights_audioImageModel /scratch/supreeth/listen_to_look/listen_to_look/ImageAudioNet_ActivityNet.pth --checkpoint_freq 10 --episode_length 10 --checkpoint_path /scratch/supreeth/listen_to_look/checkpoints/exp --freeze_imageAudioNet --with_avgpool_ce_loss --compute_mAP --mean_feature_as_start --subsample_factor 1 --with_replacement |& tee -a /scratch/supreeth/listen_to_look/logs/exp.log
2022-06-26 03:27:13,267 main: 43: DEBUG:: Namespace(avgpool_ce_loss_weight=1, batch_size=32, checkpoint_freq=10, checkpoint_path='/scratch/supreeth/listen_to_look/checkpoints/exp', compute_mAP=True, decode_threads=4, episode_length=10, epochs=25, feature_interpolate=False, feature_subsample=False, freeze_imageAudioNet=True, gt_feature_eval=False, hidden_size=512, lr=0.01, lstm_ce_loss_weight=1, mean_feature_as_start=True, milestones=[15, 20], momentum=0.9, num_classes=200, pretrained_model=None, print_freq=20, scheduler=True, start_epoch=0, subsample_factor=1, test_dataset_file='/scratch/supreeth/listen_to_look/listen_to_look/val.pkl', train_dataset_file='/scratch/supreeth/listen_to_look/listen_to_look/train.pkl', visualization=False, warmup_epochs=0, weight_decay=0.0001, weights_audioImageModel='/scratch/supreeth/listen_to_look/listen_to_look/ImageAudioNet_ActivityNet.pth', with_avgpool_ce_loss=True, with_lstm_ce_loss=False, with_replacement=True)
epoch,loss,acc,lr
Loading the weights for imageAudioClassifier network
2022-06-26 03:27:18,624 data: 30: DEBUG:: Time to load paths: 0.006342887878417969
2022-06-26 03:27:18,624 data: 31: DEBUG:: Number of videos: 9511
2022-06-26 03:27:18,627 data: 30: DEBUG:: Time to load paths: 0.0027451515197753906
2022-06-26 03:27:18,627 data: 31: DEBUG:: Number of videos: 4659
2022-06-26 03:27:18,629 main: 131: DEBUG:: DataParallel(
  (module): AudioPreviewModel(
    (net_imageAudio): ImageAudioModel(
      (imageAudio_fc1): Linear(in_features=1024, out_features=1024, bias=True)
      (imageAudio_fc2): Linear(in_features=1024, out_features=512, bias=True)
    )
    (net_classifier): ClassifierNet(
      (classifier): Linear(in_features=512, out_features=200, bias=True)
    )
    (image_queryfeature_mlp): Sequential(
      (0): Linear(in_features=1024, out_features=1024, bias=True)
      (1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Linear(in_features=1024, out_features=512, bias=True)
    )
    (audio_queryfeature_mlp): Sequential(
      (0): Linear(in_features=1024, out_features=1024, bias=True)
      (1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Linear(in_features=1024, out_features=512, bias=True)
    )
    (prediction_fc): Linear(in_features=1024, out_features=200, bias=True)
    (image_key_conv1x1): Sequential(
      (0): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (audio_key_conv1x1): Sequential(
      (0): Conv2d(512, 512, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (rnn): LSTMCell(512, 1024)
    (modality_attention): Linear(in_features=1024, out_features=2, bias=True)
    (softmax): Softmax(dim=-1)
  )
)
2022-06-26 03:27:18,629 main: 141: DEBUG:: Entering the training loop
Traceback (most recent call last):
  File "main.py", line 158, in <module>
    main(args)
  File "main.py", line 148, in main
    logger, epoch_logger=epoch_log, checkpointer=checkpointer, writer=writer)
  File "/home/supreeth_s_karan/Listen-to-Look/train.py", line 39, in train_epoch
    predictions, selected_imageAudioFeatures, selected_step_predictions = model.forward(image_features, audio_features, feature_masks, args.episode_length)
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/supreeth_s_karan/supreeth_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/supreeth_s_karan/Listen-to-Look/models/audioPreview_model.py", line 159, in forward
    helper_tensor[torch.LongTensor(range(self.batch_size)), image_feature_indexes_next] = 0 #change to 0 of indexed positions
RuntimeError: the derivative for 'indices' is not implemented

Pre Trained Models

Do you have pre trained models?

train or val ,pretrained models?

Dear Doctor:
The features you provided is fussy to distinguish whether it‘train set or val set.I beg for the clearly divided dataset（audio+visual）or features.
What's more ,I can't find the pretrained model.
Thanks！

About Parameters N.

Because untrimmed video has arbitrary sizes so I guess the value of N is not static. I found in your supplement materials. You say 16 frames to extract image vector. So I guess the N of every untrimmed video sample can be calculated by dividing 16. And T is static ,you use the T as 10.

Zj in the second model

Dear scholar, does the Zj in the untrimmed video also refer to the starting frame in one video clip Vj? In the situation, I divided the untrimmed video sample into N Video clips.

About the ~z

Dear scholar, I want to ask if the ~z effects the value of z{z1,z2,z3,...zn}.

how you dealt with videos without audio channels.

About the parameter N.

Hi, scholar. I want to ask you about the untrimmed video test. In your paper, it said z{z1,z2,z3,z4,...zn}. And you didn't use N time steps , you want to extract T time steps to the final classification for one video sample. And I guessed if you every time use the same set z{z1,z2,z3,z4,...zn} to find the most possible useful indexed feature , called as zj, j belongs to [0,N]. And iteratively select T times in the same z{z1,z2,z3,z4,...zn}. And aggreated the T time steps feature to classify the action.

How to test on the UCF-101 dataset?

Thank you for your great work！

How to get the image features, audio features, and image-audio features for the UCF-101 dataset?

Thank you very much!

facebookresearch / listen-to-look Goto Github PK

listen-to-look's Introduction

Listen to Look: Action Recognition by Previewing Audio (CVPR 2020)

Preparation

Training and Testing

Acknowlegements

Licence

listen-to-look's People

Stargazers

Watchers

Forkers

listen-to-look's Issues

Recommend Projects

Recommend Topics

Recommend Org