Giter Site home page Giter Site logo

soundnet-tensorflow's Introduction

SoundNet-tensorflow

TensorFlow implementation of "SoundNet" that learns rich natural sound representations.

Code for paper "SoundNet: Learning Sound Representations from Unlabeled Video" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

from soundnet

Prerequisites

  • Linux
  • NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
  • Python 2.7 with numpy or Python 3.5
  • Tensorflow 1.0.0 (up to 1.3.0)
  • librosa

Getting Started

  • Clone this repo:
git clone [email protected]:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
  • Pretrained Model

I provide pre-trained models that are ported from soundnet. You can download the 8 layer model here. Please place it as ./models/sound8.npy in your folder.

  • Data

Prepare you input mp3 files and place them under ./data/

Generate a input file txt and place it under ./

./data/0001.mp3
./data/0002.mp3
./data/0003.mp3
...

Follow the steps in extract features

  • NOTE

If you found out that some audio with offset value start in FFMPEG will cause a tremendous difference between torch audio and librosa, please convert it with following command.

sox {input.mp3} {output.mp3} trim 0

After this, the result might be much better.

Demo

For demo, you can follow the following steps

i) Download a converted npy file demo.npy and place it under ./data/

ii) To extract multiple features from a pretrained model with torch lua audio loaded sound track: The sound track is equivalent with torch version.

python extract_feat.py -m {start layer number} -x {end layer numbe} -s

Then you can compare the outputs with torch ones.

Feature Extraction

Minimum example

i) Download input file demo.mp3 and place it under ./data/

ii) Prepare a file list in txt format (demo.txt) that includes the input mp3 file(s) and place it under ./

./data/demo.mp3

iii) Then extract features from raw wave in demo.txt: Please put the demo mp3 under ./data/demo.mp3

python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt

More options

To extract multiple features from a pretrained model with downloaded mp3 dataset:

python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract

e.g. extract layer 4 to layer 17 and save as ./sound_out/tf_fea%02d.npy:

python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract

More details are in:

python extract_feat.py -h

Finetuning

To train from an existing model:

python main.py 

Training

To train from scratch:

python main.py -p train

To extract features:

python main.py -p extract -m {start layer number} -x {end layer numbe} -s

More details are in:

python main.py -h

TODOs

  • Change audio loader to soundnet format
  • Make it compatible to Python 3 format
  • Batch Norm behaviour different from Torch
  • Fix conv8 padding issue in training phase
  • Change all config into tf.app.flags
  • Change dummy distribution of scene and object to useful placeholder
  • Add sound and feature loader from Data section

Known issues

  • Loaded audio length is not consist in torch7 audio and librosa. Here is the issue
  • Training with a short length audio will make conv8 complain about output size would be negative

FAQs

  • Why my loaded sound wave is different from torch7 audio to librosa: Here is my WiKi

Acknowledgments

Code ported from soundnet. And Torch7-Tensorflow loader are from tf_videogan. Thanks for their excellent work!

Author

Hou-Ning Hu / @eborboihuc

soundnet-tensorflow's People

Contributors

eborboihuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

soundnet-tensorflow's Issues

Extracting features in pool5

I have read in the paper that the best layer for feature extraction is 'pool5'. However, the feature sizes in that layer are h x w x 256.
Any idea how that 3D array has to be process for an SVM as it is said in the paper?

Using our pre-trained model, you can extract discriminative features for natural sound recognition. In our experiments, pool5 seems to work the best with a linear SVM

audio length

Thanks for your hard work on audio embedding. when I extract feature on my own dataset, I am wondering if the pretrained model can only accept the audio of which length has more than 8 or 9s, for example, if i use some audios which just have 3 to 5s, it can not be accepted and occur error like TypeError: 'float' object cannot be interpreted as an integer

Problem about division in python3

Thanks for the efforts!
However, there seems to be a minor problem with python3,
in line 59, utils
raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1
should be
raw_audio = np.tile(raw_audio, length//raw_audio.shape[0] + 1)
as python3 doesn't automatically convert float to int in division

numpy tile parameter bug

Line 59 in util.py
raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1)
Should be
raw_audio = np.tile(raw_audio, length//raw_audio.shape[0] + 1)

Trivial python 3 incompatabilities

I think it would be a good idea to use python 3 syntax regarding print-statements, it's a minor thing that will probably allow this to be python 3 compatible.

pytorch pre-trained model

Thank you for your great work!
Do you have pre-trained a model of pytorch implement? Or is it possible to transform the .npy pre-trained model to .pth?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.