Giter Site home page Giter Site logo

vggsound's Introduction

VGGSound

Code and results for ICASSP2020 "VGGSound: A Large-scale Audio-Visual Dataset".

The repo contains the dataset file and our best audio classification model.

Dataset

To download VGGSound, we provide a csv file. For each YouTube video, we provide YouTube URLs, time stamps, audio labels and train/test split. Each line in the csv file has columns defined by here.

# YouTube ID, start seconds, label,train/test split. 

A helpful link for data download!

Audio classification

We detail the audio classfication results here.

  • Pretrain refers whether the model was pretrained on YouTube-8M dataset.
  • Dataset (common) means it is a subset of the dataset. This subset only contains data of common classes (listed here) between AudioSet and VGGSound.
  • ASTest is the intersection of AudioSet and VGGSound testsets.
Model Aggregation Pretrain Finetune/Train Test mAP AUC d-prime
A VGGish \ ✔️ AudioSet (common) ASTest 0.286 0.899 1.803
B VGGish \ ✔️ VGGSound (common) ASTest 0.326 0.916 1.950
C VGGish \ VGGSound (common) ASTest 0.301 0.910 1.900
D ResNet18 AveragePool VGGSound (common) ASTest 0.328 0.923 2.024
E ResNet18 NetVLAD VGGSound (common) ASTest 0.369 0.927 2.058
F ResNet18 AveragePool VGGSound ASTest 0.404 0.944 2.253
G ResNet18 NetVLAD VGGSound ASTest 0.434 0.950 2.327
H ResNet18 AveragePool VGGSound VGGSound 0.516 0.968 2.627
I ResNet18 NetVLAD VGGSound VGGSound 0.512 0.970 2.660
J ResNet34 AveragePool VGGSound ASTest 0.409 0.947 2.292
K ResNet34 AveragePool VGGSound VGGSound 0.529 0.972 2.703
L ResNet50 AveragePool VGGSound ASTest 0.412 0.949 2.309
M ResNet50 AveragePool VGGSound VGGSound 0.532 0.973 2.735

Environment

  • Python 3.6.8
  • Pytorch 1.3.0

Pretrained model and evaluation

We provide the pretrained models H an I here,

wget http://www.robots.ox.ac.uk/~vgg/data/vggsound/models/H.pth.tar
wget http://www.robots.ox.ac.uk/~vgg/data/vggsound/models/I.pth.tar

To test the model and generate prediction files,

python test.py --data_path "directory to audios/" --result_path "directory to predictions/" --summaries "path to pretrained models" --pool "avgpool"

To evaluate the model performance using the generated prediction files,

python eval.py --result_path "directory to predictions/"

Citation

@InProceedings{Chen20,
  author       = "Honglie Chen and Weidi Xie and Andrea Vedaldi and Andrew Zisserman",
  title        = "VGGSound: A Large-scale Audio-Visual Dataset",
  booktitle    = "International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
  year         = "2020",
}

License

The VGG-Sound dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

vggsound's People

Contributors

hche11 avatar weidixie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vggsound's Issues

Training settings

Thank you for sharing a wonderful dataset.
Could you explain about the training settings, such as learning rate, weight decay, scheduler, etc.?
It is kind of hard to reproduce similar accuracy.

how do I issue the target video that shorter than 10s?

Hi@hche11, recently I'm trying to download some popular dataset, such as VGG-SS. However your link didn't provide the entire dataset rather give us a way to get, so firstly I need to download your VGGSound dataset, but I found there are some videos shorter than 10s in test set, maybe we can fix the gap by using ffmpeg, but I don't know how because it's not clear in the article:VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET. I'd very appreciate it if you can help me, best wishes to you!

corrupted video?

Hi,

Thank you for releasing VGGSound! I've downloaded the dataset via the link here. It contains 199,176 videos. However, when I use skvideo.io.vread to load the video into array, video l6q7shFb8zs_000041 and EktPRpYX9KI_000038 report this error:

Traceback (most recent call last):
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 271, in _read_frame_data
    assert len(arr) == framesize
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "debug.py", line 194, in <module>
    check_corrupted_video()
  File "debug.py", line 173, in check_corrupted_video
    videodata = skvideo.io.vread(video)
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/io.py", line 148, in vread
    for idx, frame in enumerate(reader.nextFrame()):
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 297, in nextFrame
    yield self._readFrame()
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 281, in _readFrame
    s = self._read_frame_data()
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 275, in _read_frame_data
    raise RuntimeError("%s" % (err1,))
RuntimeError

Does this mean these two videos are corrupted? Thank you for your time and effort. Really looking forward to your reply!

why i can't generate test.py?

why i can't generate test.py?

I ran the following command:

python test.py --data_path "data/golf_swing.wav" --result_path "predRes" --summaries "models/H.pth.tar" --pool "avgpool"

my sound file: golf_swing.wav
res dir: predRes
pretrained weight: H.pth.tar

number of entries in vggsound.csv do not match the test and train split files

The file vggsound.csv file lists 199467 entries. That number does not match the sum of the test and train files. See

$ wc -l data/train.csv data/test.csv 
 183730 data/train.csv
  15446 data/test.csv
 199176 total
$ wc -l data/vggsound.csv 
199467 data/vggsound.csv

The vggsound.csv file have an extra 291 entries. The extra entries are in both the train and test split:

$ python3 -c 'import csv; [print(x[3]) for x in csv.reader(open("data/vggsound.csv"))]' | sort | uniq -c
  15496 test
 183971 train

I happen to have a copy of the file vggsound.csv as downloaded from the VGG website and these numbers matched.

Data Download link does not work

Thank you for your great contributions! Just one small problem in this readme file, the "helpful data download" link does not work. I appreciate it if you can update it sometime. Thanks so much!!

Couldn't Evaluate the Predictions Generated

Apologies for the long post and my ignorance.

Generating Predictions:
I downloaded the audio files using the scripts from the mentioned GitHub directory. After that I generated the predictions using the following command.

python3 --summaries "./Weights/vggsound_avgpool.pth.tar" --pool "avgpool"  --batch_size=1

While generating the predictions I got the follwoing error.

  File "/usr/local/lib/python3.6/dist-packages/scipy/signal/spectral.py", line 1757, in _spectral_helper
    raise ValueError('noverlap must be less than nperseg.')

Solved the error using the default (nperseg//8) value for noverlap. But got the following warnings.

UserWarning: nperseg = 256 is greater than input length  = 20, using nperseg = 20
  .format(nperseg, input_length))

Had to make the following change in the file test.py to line number 108.

//_Original_
aud_o = model(spec.unsqueeze(1).float())
//_Changed to_
aud_o = model(spec.unsqueeze(1).squeeze(-1).float())

Otherwise it was giving the following error.

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 1, 7, 7], but got 5-dimensional input of size [1, 1, 160000, 11, 1] instead

Evaluating:
While evaluating the predictions using the script eval.py it showed the following errors.

RuntimeWarning: invalid value encountered in true_divide
  recall = tps / tps[-1]
Traceback (most recent call last):
  File "eval.py", line 71, in 
    main()
  File "eval.py", line 64, in main
    mAUC = np.mean([stat['auc'] for stat in stats])
  File "eval.py", line 64, in 
    mAUC = np.mean([stat['auc'] for stat in stats])
KeyError: 'auc'

I didn't download the whole dataset. I downloaded a part(2231) of it and generated my own my_test.csv from the downloaded files. The audio files were downloaded in .flac format and then converted to .wav

Would you please tell me what am I doing wrong? I am new to Deep Learning research arena so please do pardon my ignorance.

A few bad samples in the dataset: still-frame vids, muted audio, short vids (< 10 sec)

I played with the dataset a little and found some flawed examples (please correct me if it is expectable).

I am not after criticizing the paper but rather sharing my findings with others who might want to use the dataset for their applications 🤗. It is not that significant considering the size of the dataset and the number of flawed examples (<5 %) and the sets do intersect! However, it might prevent one from facing strange errors when dealing with the dataset.

Pretrained file is damaged

Thanks for your great contribution. I try to download "vggsound_avgpool" of model H, but the file is damaged. It is appreciated it if you can upload the file again!

Statistics for resolution

Thank you for your contributions.

I wonder if there are any statistics on the quality of the videos. (e.g. # of 720p : 10000, # of 480p : 23000 ... )

Thanks.

[Question] Extract ResNet audio embedding layer

I would like to extract the "embedding" layer of the VGG network implemented in models.
By example, in the case of for resnet-18 for images, I would take the avgpool like

model = models.resnet18(pretrained=True)
layer = model._modules.get('avgpool')
self.layer_output_size = 512

Is that correct for VGGSound?

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.