hche11 / vggsound Goto Github PK

View Code? Open in Web Editor NEW

284.0 6.0 30.0 5.54 MB

VGGSound: A Large-scale Audio-Visual Dataset

Home Page: http://www.robots.ox.ac.uk/~vgg/data/vggsound/

License: Other

Python 100.00%

vggsound's Introduction

VGGSound

Code and results for ICASSP2020 "VGGSound: A Large-scale Audio-Visual Dataset".

The repo contains the dataset file and our best audio classification model.

Dataset

To download VGGSound, we provide a csv file. For each YouTube video, we provide YouTube URLs, time stamps, audio labels and train/test split. Each line in the csv file has columns defined by here.

# YouTube ID, start seconds, label,train/test split.

A helpful link for data download!

Audio classification

We detail the audio classfication results here.

Pretrain refers whether the model was pretrained on YouTube-8M dataset.
Dataset (common) means it is a subset of the dataset. This subset only contains data of common classes (listed here) between AudioSet and VGGSound.
ASTest is the intersection of AudioSet and VGGSound testsets.

	Model	Aggregation	Pretrain	Finetune/Train	Test	mAP	AUC	d-prime
A	VGGish	\	✔️	AudioSet (common)	ASTest	0.286	0.899	1.803
B	VGGish	\	✔️	VGGSound (common)	ASTest	0.326	0.916	1.950
C	VGGish	\	❌	VGGSound (common)	ASTest	0.301	0.910	1.900
D	ResNet18	AveragePool	❌	VGGSound (common)	ASTest	0.328	0.923	2.024
E	ResNet18	NetVLAD	❌	VGGSound (common)	ASTest	0.369	0.927	2.058
F	ResNet18	AveragePool	❌	VGGSound	ASTest	0.404	0.944	2.253
G	ResNet18	NetVLAD	❌	VGGSound	ASTest	0.434	0.950	2.327
H	ResNet18	AveragePool	❌	VGGSound	VGGSound	0.516	0.968	2.627
I	ResNet18	NetVLAD	❌	VGGSound	VGGSound	0.512	0.970	2.660
J	ResNet34	AveragePool	❌	VGGSound	ASTest	0.409	0.947	2.292
K	ResNet34	AveragePool	❌	VGGSound	VGGSound	0.529	0.972	2.703
L	ResNet50	AveragePool	❌	VGGSound	ASTest	0.412	0.949	2.309
M	ResNet50	AveragePool	❌	VGGSound	VGGSound	0.532	0.973	2.735

Environment

Python 3.6.8
Pytorch 1.3.0

Pretrained model and evaluation

We provide the pretrained models H an I here,

wget http://www.robots.ox.ac.uk/~vgg/data/vggsound/models/H.pth.tar
wget http://www.robots.ox.ac.uk/~vgg/data/vggsound/models/I.pth.tar

To test the model and generate prediction files,

python test.py --data_path "directory to audios/" --result_path "directory to predictions/" --summaries "path to pretrained models" --pool "avgpool"

To evaluate the model performance using the generated prediction files,

python eval.py --result_path "directory to predictions/"

Citation

@InProceedings{Chen20,
  author       = "Honglie Chen and Weidi Xie and Andrea Vedaldi and Andrew Zisserman",
  title        = "VGGSound: A Large-scale Audio-Visual Dataset",
  booktitle    = "International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
  year         = "2020",
}

License

The VGG-Sound dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

vggsound's People

Contributors

Stargazers

Watchers

vggsound's Issues

firewall policy 10263

Hello,

I want to access the dataset and I get this. Can you help me?

Training settings

Thank you for sharing a wonderful dataset.
Could you explain about the training settings, such as learning rate, weight decay, scheduler, etc.?
It is kind of hard to reproduce similar accuracy.

how do I issue the target video that shorter than 10s?

Hi@hche11, recently I'm trying to download some popular dataset, such as VGG-SS. However your link didn't provide the entire dataset rather give us a way to get, so firstly I need to download your VGGSound dataset, but I found there are some videos shorter than 10s in test set, maybe we can fix the gap by using ffmpeg, but I don't know how because it's not clear in the article:VGGSOUND: A LARGE-SCALE AUDIO-VISUAL DATASET. I'd very appreciate it if you can help me, best wishes to you!

Q&A : Would you be sharing the code for training the model

Would you be sharing the code for training the model from scratch?
Is it possible to train with Mobilenet (V1/v2) kind of architecture?

Scripts for downloading audio

Thanks for open sourcing the dataset and models. Could you please provide a minimal script to download audio only? I checked the mentioned repository but it is not really working, throwing various errors.

Model link is not working

Hi can you please update the link you published to the pre-trained models?

corrupted video?

Hi,

Thank you for releasing VGGSound! I've downloaded the dataset via the link here. It contains 199,176 videos. However, when I use skvideo.io.vread to load the video into array, video l6q7shFb8zs_000041 and EktPRpYX9KI_000038 report this error:

Traceback (most recent call last):
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 271, in _read_frame_data
    assert len(arr) == framesize
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "debug.py", line 194, in <module>
    check_corrupted_video()
  File "debug.py", line 173, in check_corrupted_video
    videodata = skvideo.io.vread(video)
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/io.py", line 148, in vread
    for idx, frame in enumerate(reader.nextFrame()):
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 297, in nextFrame
    yield self._readFrame()
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 281, in _readFrame
    s = self._read_frame_data()
  File "/home/anaconda3/envs/myenv/lib/python3.8/site-packages/skvideo/io/ffmpeg.py", line 275, in _read_frame_data
    raise RuntimeError("%s" % (err1,))
RuntimeError

Does this mean these two videos are corrupted? Thank you for your time and effort. Really looking forward to your reply!

why i can't generate test.py?

I ran the following command:

python test.py --data_path "data/golf_swing.wav" --result_path "predRes" --summaries "models/H.pth.tar" --pool "avgpool"

my sound file: golf_swing.wav
res dir: predRes
pretrained weight: H.pth.tar

number of entries in vggsound.csv do not match the test and train split files

The file vggsound.csv file lists 199467 entries. That number does not match the sum of the test and train files. See

$ wc -l data/train.csv data/test.csv 
 183730 data/train.csv
  15446 data/test.csv
 199176 total
$ wc -l data/vggsound.csv 
199467 data/vggsound.csv

The vggsound.csv file have an extra 291 entries. The extra entries are in both the train and test split:

$ python3 -c 'import csv; [print(x[3]) for x in csv.reader(open("data/vggsound.csv"))]' | sort | uniq -c
  15496 test
 183971 train

I happen to have a copy of the file vggsound.csv as downloaded from the VGG website and these numbers matched.

Could you provide a sample audio file for quick test?

Hi,
I would like to run your test code. But it is not convenient to download the full dataset. So could you provide a sample audio?

The link of data download can not be accessed

I have tried to access data download. However, this page is invalid. The homepage for marl does not contain thisrepository.

Data Download link does not work

Thank you for your great contributions! Just one small problem in this readme file, the "helpful data download" link does not work. I appreciate it if you can update it sometime. Thanks so much!!

Some youtube videos are missing in the data

Hello,

When we try to use the vggsound.csv to download the videos. We found there are around 20% videos, which are not accessible for various reasons. For example: https://www.youtube.com/watch?v=ICSs4gbJRIc and https://www.youtube.com/watch?v=I5_9FYVk6cQ.

Thanks.

Couldn't Evaluate the Predictions Generated

Apologies for the long post and my ignorance.

Generating Predictions:
I downloaded the audio files using the scripts from the mentioned GitHub directory. After that I generated the predictions using the following command.

python3 --summaries "./Weights/vggsound_avgpool.pth.tar" --pool "avgpool"  --batch_size=1

While generating the predictions I got the follwoing error.

  File "/usr/local/lib/python3.6/dist-packages/scipy/signal/spectral.py", line 1757, in _spectral_helper
    raise ValueError('noverlap must be less than nperseg.')

Solved the error using the default (nperseg//8) value for noverlap. But got the following warnings.

UserWarning: nperseg = 256 is greater than input length  = 20, using nperseg = 20
  .format(nperseg, input_length))

Had to make the following change in the file test.py to line number 108.

//_Original_
aud_o = model(spec.unsqueeze(1).float())
//_Changed to_
aud_o = model(spec.unsqueeze(1).squeeze(-1).float())

Otherwise it was giving the following error.

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 1, 7, 7], but got 5-dimensional input of size [1, 1, 160000, 11, 1] instead

Evaluating:
While evaluating the predictions using the script eval.py it showed the following errors.

RuntimeWarning: invalid value encountered in true_divide
  recall = tps / tps[-1]
Traceback (most recent call last):
  File "eval.py", line 71, in 
    main()
  File "eval.py", line 64, in main
    mAUC = np.mean([stat['auc'] for stat in stats])
  File "eval.py", line 64, in 
    mAUC = np.mean([stat['auc'] for stat in stats])
KeyError: 'auc'

I didn't download the whole dataset. I downloaded a part(2231) of it and generated my own my_test.csv from the downloaded files. The audio files were downloaded in .flac format and then converted to .wav

Would you please tell me what am I doing wrong? I am new to Deep Learning research arena so please do pardon my ignorance.

A few bad samples in the dataset: still-frame vids, muted audio, short vids (< 10 sec)

I played with the dataset a little and found some flawed examples (please correct me if it is expectable).

Still frames segments (usually from a music/sound_effects youtube channel) or duplicates
Example: hpDY7u8B8hE_5000_15000 "Horse Neigh Many / Horse, Horses..." (YouTube)
- Procedure 1: several zero-dist groups using nearest neighbor on some visual features
  How many: 408 examples
  List: filtered_in_0_dists_group.txt
- Procedure 2: distance between consecutive video features (not frame-level features) is less than 0.01 ((feat[1:, :] - feat[:-1, :]).sum(-1) feat are of size (20, 512))
  How many: 5965 examples
  List: filtered_video_has_still_frames.txt
  1st col: vid_id,
  2nd col: portion_of_same_features -- because sometimes 0.05 is <1 sec of still frames but the rest is ok. So it is up to you if you would like to use such examples.
- Procedure 3: Mostly manual work. I found examples close to the ones found in procedures 1 and 2 👆 in TSNE (2d), clustered the TSNE, and took close clusters if a majority of manually checked examples are flawed. Please note, it is a bit fuzzy and has some mistakes (falsely thinks that some examples are bad). I sorted the lists according to how confident I am that a list has most examples flawed.
  List (how many) [sorted by confidence: high to low]:
  region_app.txt (2)
  region_babies.txt (12)
  region_bird.txt (13)
  region_blue_circle.txt (8)
  region_flipped.txt (19): content is ok but the vid is rotated for 90º
  region_footsteps.txt (24)
  region_tube.txt (2)
  filtered_manually_found copy.txt (18)
  145.txt (427)
  190.txt (358)
  299.txt (566)
Segments with no audio
Example: YamCgQFbo7c_60000_70000 "Mécanismes pour lits muraux, à ouverture verticale" (YouTube)
Procedure: zero std on the middle 5 seconds of the audio track
How many: 1010 examples
List: filtered_audio_has_0_std.txt
Short videos (<5 seconds) -- if you like I can provide for <X seconds (for any X)
Example: 6dhXrzs8pJc_0_10000 (a bit loud) "Funny Goat saying hey" (YouTube)
Procedure: measured length of a video
How many: 37 examples
List: filtered_vid_has_short_vfeats.txt

I am not after criticizing the paper but rather sharing my findings with others who might want to use the dataset for their applications 🤗. It is not that significant considering the size of the dataset and the number of flawed examples (<5 %) and the sets do intersect! However, it might prevent one from facing strange errors when dealing with the dataset.

Pretrained file is damaged

Thanks for your great contribution. I try to download "vggsound_avgpool" of model H, but the file is damaged. It is appreciated it if you can upload the file again!

Statistics for resolution

Thank you for your contributions.

I wonder if there are any statistics on the quality of the videos. (e.g. # of 720p : 10000, # of 480p : 23000 ... )

Thanks.

[Question] Extract ResNet audio embedding layer

I would like to extract the "embedding" layer of the VGG network implemented in models.
By example, in the case of for resnet-18 for images, I would take the avgpool like

model = models.resnet18(pretrained=True)
layer = model._modules.get('avgpool')
self.layer_output_size = 512

Is that correct for VGGSound?

Thank you.