Giter Site home page Giter Site logo

samplernn_iclr2017's Introduction

SampleRNN

Code accompanying the paper SampleRNN: An Unconditional End-to-End Neural Audio Generation Model [here and here]. Samples are available here.

Unrolled model

Dependencies

Extensively tested with:

  • cuDNN 5105
  • Python 2.7.12
  • Numpy 1.11.1
  • Theano 0.8.2 (0.9 for WaveNet re-implementation)
  • Lasagne 0.2.dev1

Datasets

Music dataset was created from all 32 Beethoven’s piano sonatas available publicly on archive.org. datasets/music contains scripts to preprocess and build this dataset. It is also available here for download. Extract the tar file and put all the numpy files in datasets/music directory.

Training

To train a model on an existing dataset with accelerated GPU processing, you need to run following lines from the root of sampleRNN_ICLR2017 folder which corresponds to the best found set of hyper-paramters.

Mission control center:

$ pwd
/u/mehris/sampleRNN_ICLR2017

SampleRNN (2-tier)

$ python models/two_tier/two_tier.py -h
usage: two_tier.py [-h] [--exp EXP] --n_frames N_FRAMES --frame_size
                   FRAME_SIZE --weight_norm WEIGHT_NORM --emb_size EMB_SIZE
                   --skip_conn SKIP_CONN --dim DIM --n_rnn {1,2,3,4,5}
                   --rnn_type {LSTM,GRU} --learn_h0 LEARN_H0 --q_levels
                   Q_LEVELS --q_type {linear,a-law,mu-law} --which_set
                   {ONOM,BLIZZ,MUSIC} --batch_size {64,128,256} [--debug]
                   [--resume]

two_tier.py No default value! Indicate every argument.

optional arguments:
  -h, --help            show this help message and exit
  --exp EXP             Experiment name
  --n_frames N_FRAMES   How many "frames" to include in each Truncated BPTT
                        pass
  --frame_size FRAME_SIZE
                        How many samples per frame
  --weight_norm WEIGHT_NORM
                        Adding learnable weight normalization to all the
                        linear layers (except for the embedding layer)
  --emb_size EMB_SIZE   Size of embedding layer (0 to disable)
  --skip_conn SKIP_CONN
                        Add skip connections to RNN
  --dim DIM             Dimension of RNN and MLPs
  --n_rnn {1,2,3,4,5}   Number of layers in the stacked RNN
  --rnn_type {LSTM,GRU}
                        GRU or LSTM
  --learn_h0 LEARN_H0   Whether to learn the initial state of RNN
  --q_levels Q_LEVELS   Number of bins for quantization of audio samples.
                        Should be 256 for mu-law.
  --q_type {linear,a-law,mu-law}
                        Quantization in linear-scale, a-law-companding, or mu-
                        law compandig. With mu-/a-law quantization level shoud
                        be set as 256
  --which_set {ONOM,BLIZZ,MUSIC}
                        ONOM, BLIZZ, or MUSIC
  --batch_size {64,128,256}
                        size of mini-batch
  --debug               Debug mode
  --resume              Resume the same model from the last checkpoint. Order
                        of params are important. [for now]

To run:

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC

SampleRNN (3-tier)

$ python models/three_tier/three_tier.py -h
usage: three_tier.py [-h] [--exp EXP] --seq_len SEQ_LEN --big_frame_size
                     BIG_FRAME_SIZE --frame_size FRAME_SIZE --weight_norm
                     WEIGHT_NORM --emb_size EMB_SIZE --skip_conn SKIP_CONN
                     --dim DIM --n_rnn {1,2,3,4,5} --rnn_type {LSTM,GRU}
                     --learn_h0 LEARN_H0 --q_levels Q_LEVELS --q_type
                     {linear,a-law,mu-law} --which_set {ONOM,BLIZZ,MUSIC}
                     --batch_size {64,128,256} [--debug] [--resume]

three_tier.py No default value! Indicate every argument.

optional arguments:
  -h, --help            show this help message and exit
  --exp EXP             Experiment name
  --seq_len SEQ_LEN     How many samples to include in each Truncated BPTT
                        pass
  --big_frame_size BIG_FRAME_SIZE
                        How many samples per big frame in tier 3
  --frame_size FRAME_SIZE
                        How many samples per frame in tier 2
  --weight_norm WEIGHT_NORM
                        Adding learnable weight normalization to all the
                        linear layers (except for the embedding layer)
  --emb_size EMB_SIZE   Size of embedding layer (> 0)
  --skip_conn SKIP_CONN
                        Add skip connections to RNN
  --dim DIM             Dimension of RNN and MLPs
  --n_rnn {1,2,3,4,5}   Number of layers in the stacked RNN
  --rnn_type {LSTM,GRU}
                        GRU or LSTM
  --learn_h0 LEARN_H0   Whether to learn the initial state of RNN
  --q_levels Q_LEVELS   Number of bins for quantization of audio samples.
                        Should be 256 for mu-law.
  --q_type {linear,a-law,mu-law}
                        Quantization in linear-scale, a-law-companding, or mu-
                        law compandig. With mu-/a-law quantization level shoud
                        be set as 256
  --which_set {ONOM,BLIZZ,MUSIC}
                        ONOM, BLIZZ, or MUSIC
  --batch_size {64,128,256}
                        size of mini-batch
  --debug               Debug mode
  --resume              Resume the same model from the last checkpoint. Order
                        of params are important. [for now]

To run:

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/three_tier/three_tier.py --exp BEST_3TIER --seq_len 512 --big_frame_size 8 --frame_size 2 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 1 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC

Reference

If you are using this code, please cite the paper.

@article{mehri2016samplernn, Author = {Soroush Mehri and Kundan Kumar and Ishaan Gulrajani and Rithesh Kumar and Shubham Jain and Jose Sotelo and Aaron Courville and Yoshua Bengio}, Title = {SampleRNN: An Unconditional End-to-End Neural Audio Generation Model}, Year = {2016}, Journal = {arXiv preprint arXiv:1612.07837}, }

Torch implementation

Thanks to Richard Assar, now we have a Torch implementation available:

https://github.com/richardassar/SampleRNN_torch

Miscellaneous

If needed or have interesting related project/results, please don't hesitate to contact us.

samplernn_iclr2017's People

Contributors

kundan2510 avatar richardassar avatar soroushmehr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

samplernn_iclr2017's Issues

Arbitrary integer index to training, validation and test arrays?

In /datasets/music/_2npy.py on line 33, 34, 35 we create the numpy files to be used as datasets and we specify the length of each array to be as such:

np.save('music_train.npy', np_arr[:-2*256])
np.save('music_valid.npy', np_arr[-2*256:-256])
np.save('music_test.npy', np_arr[-256:])

The problem is that when trying a different dataset that yields an array with a length shorter than 512 we're going to create an empty array in the music_train.npy field.

In the paper it is suggested we use a partition like 88:6:6 for the three sets-- couldn't we do this with something like numpy.split in order to ensure that no matter the size of the array we will still get the correct partition? Or am I missing something that requires it to be hardcoded like above?

Noise segments

Hi, and thanks for sharing this wonderful model.
I must say, it is probably the best model I've tried so far for sample-by-sample generation.

I do have a question, though: I constantly end up having noise-burst or noise-segments on my generated audios; sometimes that covers the whole generation, sometimes it just comes at a certain point.

This does not seem to improve with epochs (after a couple of days of training, I still have the same noise burst), and this seems to happen throughout different datasets (mostly classical music) - oddly enough, this seems to be better in noisy datasets, such as rain or water sounds :)

Does anyone have an idea of how to avoid or limit this issue?
Is there a parameter I should fine tune for this?

Thanks again,
Daniele

SampleRNN performance

On a Tesla K80, how much time does an epoch take? For the samples listed on soundcloud, how many epochs is it? What hardware is used?

SampleRNN (2-tier) example (from readme.md) won't run

Interesting paper and approach, it would be very exciting to compare results with wavenet, however, an attempted test run suggested in readme.md (mentioned as sampleRNN(2-tier) fails with the following error:

MemoryError: Error allocating 2147483648 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).

I am familiar with TensorFlow (of Google) but not Theano, so I have not looked at the code yet, I'm hoping it might be something easy for you guys to spot and direct me to a quick soltuion.

The GPU I have is GeForce GTX 1080, 8Gig of mem. The initial CNMeM is set to 50% in the .theanorc file ( [lib]
cnmem=0.5):
Using gpu device 0: GeForce GTX 1080
(CNMeM is enabled with initial size: 50.0% of memory, cuDNN 5110)

During initial run, the nvidia-smi tool shows initial alloc of GPU mem at ~4Gig, it stays at that level and the training run fails moments later with the above mem alloc issue.

The training run is inside a docker image, with the following pertinent version info:
nvcc --version
Cuda compilation tools, release 8.0, V8.0.61
Python:
sys.version_info sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
Numpy:
.version 1.11.0
Theano:
.version 0.9.0rc2.dev-19540d4e3064fe0dc0e1281f517bad0f355e46a2
pip show lasagne
Metadata-Version: 2.0
Name: Lasagne
Version: 0.1

Thanks!

Teacher forcing

The paper mentions teacher forcing. I don't see it in the code. Am I mistaken?

EDIT: I had "teaching forcing" mixed up in my head with "scheduled sampling". Indeed SampleRNN uses teacher forcing.

Replicating dataset splits?

I want to compare the results in this paper on the three datasets to my own approach. How do I make sure that I have the exact same data including the same training/validation/test split?

For example for the Blizzard dataset, i couldn't see an "official" split, so I suppose you used your own one for the paper? In that case, where can I find details on it so I can split it the same way?

Similarly for the Piano dataset, I am wondering how to get the dataset in full format (full song files, not 8-sec chunks) and partition it in the same way you guys have done it.

Otherwise it's very hard to compare the results to any other method.

Example of music generated using mostly default parameters (tier-2), at epoch9 and epoch11

In case folks are interested, I've uploaded several of the generated wav files at epoch=9 and epoch=11.

Please browse to: https://soundcloud.com/user-637335781

Thanks to kundan2510 for answering my earlier question re mem alloc error and to reduce batch size.

The trainnig run was as follows:
THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 64 --weight_norm True --learn_h0 True --which_set MUSIC

GPU: Single GPU : 8 GB GTX 1080 with 2560 CUDA Cores

Epoch 9 took about 24 hours, and getting to epoch 11 took an additional 6 hours. Training run was aborted at the start of epoch12

This is a very exciting project/code you guys have! wow!

-Best regards!

StopIteration

any pointers for this? using the two_tier line as provided in the README, have preprocessed my own wavs with the scripts.

Wall clock time spent before training started: 0.01h
Training!
0
Traceback (most recent call last):
  File "models/two_tier/two_tier.py", line 615, in <module>
    mini_batch = tr_feeder.next()
StopIteration

Using My own Dataset and Training versus Generating

I wanted to know how I could use this to train on my own data set, for both the Sample RNN as well as the Wave Net. I read the question about this under issues, but I didn't really understand what specifically I should do to use my own data set. Also I wanted to know using this model am I able to reconstruct a single audio file after training the model with a corpus of similar audio. If so, can you please tell me how.

Trouble with this repo : any repo is kept up to date ?

I was able to make it work on cpu, but making it work on gpu has been simply a nightmare. This repo requires cuda 9.0, which is not supported by ubuntu 18.04 so when I realized this wasn't going to work with ubuntu 18.04, I started a clean OS on ubuntu 16.04 just to make this repo work with cuda.

While the installation of cuda went well and it worked, it was the turn of theano 0.8.2 to give me some very weird error messages such as Exception: The nvidia driver version installed with this OS does not give good results for reduction.Installing the nvidia driver available on the same download page as the cuda package will fix the problem, problem that seems common when looking up online but Theano, and to a greater extent Theano 0.8.2 is no longer supported anyway.

And beside the OS mismatch and the Theano weird error messages, my idea on my master thesis was to try new ideas starting with this code and since it uses old libraries and is no longer supported by MILA, this might not be the best idea to begin with.

So, I'm giving up with this repo. But I'm definitely not giving up on sampleRNN as my thesis is on music generation and sampleRNN seems to be ideal in my case. Any repo uses sampleRNN or something very similar but using recent version of tensorflow or pytorch and is updated regularly ? I find it weird, seeing the success this model has, to find no supported git repository.

Thank you for your attention and sorry for the displayed bitterness, I've been trying to make it work for days and I'm exhausted so let me conclude by thanking the researchers for this new model and to make it available online.

Wavenet parameters

Hi,

As you mentioned in your paper, you have re-implemented WaveNet, could you share with me some parameters that you used?

such as:

  1. how many the skip channels you use?
  2. how many the dilation channels?
  3. Is the skip channel values of each Convolution layer are summed together directly without weights?

Thanks very much.

speed of generating speech samples

I found that SampleRNN need to be run in parallel to get fast generation speed. It takes only about 500 seconds for generating 200 utterances, each with a length of 8 seconds speech. But it will be very time costing if only run one sentence in generation, more than 40 seconds for 1 second speech. It seems it's not faster than Wavenet. Does anyone have some ideas on speeding up it?

Pretty amazing results from a training run on classical guitar music (single instrument), at epoch 5 & 7

https://soundcloud.com/user-637335781/sets/training-1-on-classical-guitar-music-single-instrument-at-epoch-5-7

Subjectivley speaking , sound quality is better than the results I got from training on the piano set.

WAV file generated at epoch 5 (~15k training samples) and epoch 7 (~20K training samples)

Single GPU : 8 GB GTX 1080 with 2560 CUDA Cores
End of Epoch 7 at about 8 hours

Validation! Done!

>>> Best validation cost of 1.78753066063 reached. Testing! Done!
>>> test cost:1.8329474926	total time:60.4850599766
epoch:7	total iters:20498	wall clock time:7.04h
>>> Lowest valid cost:1.78753066063	 Corresponding test cost:1.8329474926
	train cost:1.7714	total time:6.00h	per iter:1.054s
	valid cost:1.7875	total time:0.02h
	test  cost:1.8329	total time:0.02h
Saving params! Done!

Run command:
THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 64 --weight_norm True --learn_h0 True --which_set MUSIC

Training done on several guitar passages from youtube, audio duration ~ 4 hours

// Private ref uuid : 6194a59d-b6ab-470d-95fd-6b43fe9b2daa

Unable to run on CPU (vs GPU)

Although I can run on the GPU, I like to play around on smaller files on CPU, but I can't figure out why i get this theano error, and wondering if someone might know or direct me to the next step in invetsigation.

Anyhow, in a theano docker, running a training I don't get much far due to this error:

File "./models/two_tier/two_tier.py", line 394, in <module>
    prev_samples = T.nnet.neighbours.images2neibs(prev_samples, (1, FRAME_SIZE), neib_step=(1, 1), mode='valid')
AttributeError: 'module' object has no attribute 'neighbours'

I beleive I have all the pre req's installed, latest theano, etc:

Python:
	sys.version_info	sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0)
Numpy:
	.__version__	1.8.2
Theano:
	.__version__	0.9.0dev5.dev-598991d76d2764ab7411c5dc5a2dcc59e3ef55ea

And, in python i can see module nnet DOES have neighbours:

python:
import theano
help(theano.tensor.nnet)

Help on package theano.tensor.nnet in theano.tensor:
NAME
    theano.tensor.nnet
FILE
    /usr/local/lib/python2.7/dist-packages/theano/tensor/nnet/__init__.py
PACKAGE CONTENTS
    Conv3D
    ConvGrad3D
    ConvTransp3D
    abstract_conv
    blocksparse
    bn
    conv
    conv3d2d
    corr
    corr3d
    neighbours
    nnet
    opt
    sigm
    tests (package)

Thoughts?

Trying to train using a new dataset, test_cost undefined

I'm trying to train a model using a different dataset than the Beethoven sonatas, and I'm having a bit of trouble when running

$ python -u models/three_tier/three_tier.py --exp BEST_3TIER --seq_len 512 --big_frame_size 8 --frame_size 2 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 1 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC

Basically after the first batch has been fully iterated through I get an issue with test_cost being undefined. I've tried with the two tier model as well, but it looks like the code is the same in both models, that is:

        if valid_cost < lowest_valid_cost:
            lowest_valid_cost = valid_cost
            print "\n>>> Best validation cost of {} reached. Testing!"\
                    .format(valid_cost),
            test_cost, test_time = monitor(test_feeder)

The first time test_cost is defined is in the code above, however if the conditional isn't true we're still going to try to use test_cost on line 860 (in three tier) like this (and I believe further on in the code, after hacking it a bit, that we try to use it again):

        print_info = print_info.format(epoch,
                                       total_iters,
                                       (time()-exp_start)/3600,
                                       ....

                                       test_cost,
                                       test_time/3600)

Which means we're calling an undefined test_cost. I'm not quite sure of the role of test_cost in the first batch so I cannot really modify the code to get it working.

preprocessing data

Hi,
I think this line should be:
os.system('ffmpeg -ss {} -t 8 -i {}/preprocess_all_audio.wav -ac 1 -ab 16k -ar 16000 {}/p{}.flac'.format(i*8, OUTPUT_DIR, OUTPUT_DIR, i))

Otherwise, you only use the first (int(length)//8 - 1) + 8 seconds of your training data?

Bad amplitude normalization

Problem

Every batch, the amplitudes are min-max normalized to the batch.

def __batch_quantize(data, q_levels, q_type):
    """
    One of 'linear', 'a-law', 'mu-law' for q_type.
    """
    data = data.astype('float64')
    data = __normalize(data)

. . . 

def __normalize(data):
    """To range [0., 1.]"""
    data -= data.min(axis=1)[:, None]
    data /= data.max(axis=1)[:, None]
    return data

This is bad for three reasons:

First reason: DC offset. The normalization is calculated by subtracting the minimum and dividing by the maximum. If minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.

Second reason: Each batch has different peaks, so each batch will have a different quantization value for silence.

Third reason: dynamics. let's say part of my dataset is soft, part is loud, and part is transitions between soft and loud. If randomly there's a batch of all soft sounds, they will be normalized to loud, interfering with SampleRNN's ability to learn transitions from loud and soft.

Solution

Dont batch normalize amplitudes.

Take this line out

data = __normalize(data)

I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.

To continue test wav?

First: your work is absolutely awesome! The fact that model is capable to generate reasonably sounding signal for many seconds (and hundreds of thousands samples) is awesome.

Second: my question:
I have trained models on music dataset successfully. I would like to see (hear) continuation of given wav file generated by the model. Basically to find out how well and how long model is able to continue input sound.
Please give me a hint how to do that,
thanks!
Lukas

creating a dataset

Hi,

Thank you for sharing this!

I am confused as to how to create a dataset out of a collection of music I would like to train with (what file format, length, bit rate, etc does it need to be), and how how to initiate the training.

Thanks and best regards,

Training on my own voice wav files

Is there a quick/dirty way to use my own wav files for training?
It seems to me that somehow I need to create the _train.npy, _valid.npy and _test.npy files.
thanks!

Funky GRU

GRUStep in lib/ops.py does implement a bit untraditional GRU unit:
https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/lib/ops.py#L372

return (update * candidate) + ((one - update) * last_hidden)

which is supposed to return GRU as we know:
S_t = (1 - z).h + z.S_{t-1}

Instead the code does:
S_t = (1 - z).S_{t-1} + z.h

Logically it seems not too far from the original variant except z is inverted.
But in practice does it affect any figures presented in the paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.