deepsound-project / samplernn-pytorch Goto Github PK

View Code? Open in Web Editor NEW

283.0 283.0 75.0 18 KB

PyTorch implementation of SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

License: MIT License

Python 97.95% Shell 2.05%

samplernn-pytorch's People

Stargazers

Watchers

Forkers

g-wang blitu12345 kkanska james greaber dp-aixball mannykayy negation lucas2012 deeperpop gdorca dr-benway yena53 sbl anneshachowdhury rewak akx shubhampachori12110095 kgaranger xraymemory ogugugugugua veltman huizhangdb niuqun cortexelus bartvdbraak zvk erikekstedt gcunhase rezacsedu sheldontsui nd1511 rpersie ddcas happycube sevkibaba pathaine amy-cao nicolas-herault cdefaux fdb bkappala wekaco gienyshy lilbasedbot petrladen bstivic rohan1561 ayounglok carldegs antoinedaurat zhipingzhou robinmeier nuhashaikh skywalker0803r krrnk xiaoxx18 trendingtechnology alleniver ibantxodrumz morninglory729 macroustc betciso straystray reynoldsm88 animikhaich

samplernn-pytorch's Issues

CuDNN while generating is now available (in master)

Look pytorch/pytorch#3512

No matching distribution found for torch==0.2.0.post3

I'm on the pytorch docker, and I'm extremely confused about what I'm doing wrong at this point. Any assistance is appreciated.

root@6b27a1f07b65:~/samplernn-pytorch# pip install -r requirements.txt
Collecting librosa==0.5.1 (from -r requirements.txt (line 1))
Using cached librosa-0.5.1.tar.gz
Collecting matplotlib==2.1.0 (from -r requirements.txt (line 2))
Using cached matplotlib-2.1.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting natsort==5.1.0 (from -r requirements.txt (line 3))
Using cached natsort-5.1.0-py2.py3-none-any.whl
Collecting torch==0.2.0.post3 (from -r requirements.txt (line 4))
Could not find a version that satisfies the requirement torch==0.2.0.post3 (from -r requirements.txt (line 4)) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.2.0.post3 (from -r requirements.txt (line 4))

Edit: I can get it working without cuda. At this point is torch==0.2.0.post3 vital?

pause/resume

hi, is there a way to pause/resume the training?
thanks

Talkin' 'bout my generation

Thanks for this code contribution!

Is there a way to just generate samples based on a given checkpoint without training?
The Generator is buried in the trainer code and teasing it out looks daunting.

Best,
- lonce

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect:

Hi!
I have been having a few issues running the code due to older dependencies, the closest I have gotten is running the Colab notebook referenced in the issues comments. The collab notebook but when running the commands on Anaconda I run into this issue: OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: Does anyone know what is wrong?

https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/models/three_tier/three_tier.py#L496

Is it correct that the code does not implement images2neibs? I think its unfold in pytorch?

This line in the original code: https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/models/three_tier/three_tier.py#L496

Reconstructing specific wav file

Is it possible to reconstruct a specific wav file instead of generating random audio using this model. If possible how? If not can someone tell me how to do this or give me a model that can do this.

No module naimed trainer

Hello,

when training, I get the following error. The module trainer doesn't seem to exist ??


cd .. && python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset piano
Traceback (most recent call last):
  File "train.py", line 11, in <module>
    from trainer.plugins import (
  File "/home/vincent/samplernn-pytorch/trainer/plugins.py", line 8, in <module>
    from torch.utils.trainer.plugins.plugin import Plugin
ModuleNotFoundError: No module named 'torch.utils.trainer'

Getting runtime error for sizes of tensors not matching

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 705664 and 58368 in dimension 1 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897

My batch size is 64 instead of the initial 128. It would not run with 128, was getting a divide by 0 error for 128 batch size. I do not know if this could be a factor, but thought it would be worth mentioning.

Has anyone encountered this problem, what fixes are there?

What does n_frame_samples represent?

Terminology is a bit confusing. What does n_frame_samples mean? Is this the number of samples per frame? or number of frames?

Is the RNN taking in a sequence of frames (ie: a frame per timestep), and the dimension of each frame is the "n_frame_samples"?

Loss increase dramatically after 60000 iterations

Hi, really brilliant code! when I run the pytorch implementation, I found the trianing and validation loss increases dramtically after 60000 iterations. The loss curve looks like: https://drive.google.com/open?id=1_64-jD3hOtXmrOoVMq5hD8pWfmEUc7yE

Do you have any idea of this increased loss? Thank you very much!

Qiuqiang

Which torch version to get?

Hi, some people have reported that putting torch==0.4.1 in the requirements works for them. For me this produces this error:

/content/samplernn-pytorch/model.py:60: UserWarning: nn.init.kaiming_uniform is now deprecated in favor of nn.init.kaiming_uniform_. init.kaiming_uniform(self.input_expand.weight) /content/samplernn-pytorch/model.py:61: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.input_expand.bias, 0) /content/samplernn-pytorch/nn.py:48: UserWarning: nn.init.uniform is now deprecated in favor of nn.init.uniform_. nn.init.uniform(tensor, -math.sqrt(3 / fan_in), math.sqrt(3 / fan_in)) /content/samplernn-pytorch/model.py:76: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(getattr(self.rnn, 'bias_ih_l{}'.format(i)), 0) /content/samplernn-pytorch/nn.py:62: UserWarning: nn.init.orthogonal is now deprecated in favor of nn.init.orthogonal_. init(chunk) /content/samplernn-pytorch/model.py:82: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(getattr(self.rnn, 'bias_hh_l{}'.format(i)), 0) /content/samplernn-pytorch/nn.py:31: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. nn.init.constant(self.bias, 0) /content/samplernn-pytorch/model.py:90: UserWarning: nn.init.uniform is now deprecated in favor of nn.init.uniform_. self.upsampling.conv_t.weight, -np.sqrt(6 / dim), np.sqrt(6 / dim) /content/samplernn-pytorch/model.py:92: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.upsampling.bias, 0) /content/samplernn-pytorch/model.py:141: UserWarning: nn.init.kaiming_uniform is now deprecated in favor of nn.init.kaiming_uniform_. init.kaiming_uniform(self.input.weight) /content/samplernn-pytorch/model.py:150: UserWarning: nn.init.kaiming_uniform is now deprecated in favor of nn.init.kaiming_uniform_. init.kaiming_uniform(self.hidden.weight) /content/samplernn-pytorch/model.py:151: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.hidden.bias, 0) /content/samplernn-pytorch/model.py:161: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_. init.constant(self.output.bias, 0) Traceback (most recent call last): File "train.py", line 360, in <module> main(**vars(parser.parse_args())) File "train.py", line 258, in main trainer.run(params['epoch_limit']) File "/content/samplernn-pytorch/trainer/__init__.py", line 56, in run self.train() File "/content/samplernn-pytorch/trainer/__init__.py", line 61, in train enumerate(self.dataset, self.iterations + 1): File "/content/samplernn-pytorch/dataset.py", line 51, in __iter__ for batch in super().__iter__(): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 314, in __next__ batch = self.collate_fn([self.dataset[i] for i in indices]) File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 314, in <listcomp> batch = self.collate_fn([self.dataset[i] for i in indices]) File "/content/samplernn-pytorch/dataset.py", line 34, in __getitem__ torch.from_numpy(seq), self.q_levels RuntimeError: PyTorch was compiled without NumPy support

I'm using this Colab notebook btw: https://drive.google.com/file/d/13tVz73FXyG8Xvidl-SqyxNtBwozAlvth/view?usp=sharing

Any help would be appreciated!

How do you load models?

Hi, I have been playying around with this tool in colab but I just can't figure out how to save models. I started training the AI and then I can't save the model I have trained and use it. even rerunning the cell doesn't do anything for me, it just starts over again. Is there any way to save models? Also, is there any way to sample from the model? T

Training time?

How many epochs do you need to train before you get reasonable results in results/*/samples ?
How long does an epoch take in seconds / minutes on some AWS or laptop benchmark?

Could this be added to the README?

It would be useful if there were a way to see that things are set up correctly before doing a few days of training. I did that and still have noise, not sure if I need to wait longer or if I have things set up incorrectly.

Why is overlap length in dataloader needed?

Again, great project guys, good implementation!

I had some inline questions from the dataloader file. Not understanding why it's doing the segmentation it's doing.

# dataset.py    

# A) what is this??? 
# B) why? 
# C) Is this related to the number of samples per frame for tier 3?
self.overlap_len = 64     

# length of music clip   
n_samples = 128064    

# desired seq size we want for each tng example
# D) why did you pick 1024? 
self.seq_len = 1024   

# iterate the full song 1024 units at a time    
for seq_begin in range(self.overlap_len, n_samples, self.seq_len):
    # 0 in first loop
    from_index = seq_begin - self.overlap_len  

    # 1088 in first loop.
    # E) Why not 1024? 
    # F) what is the overlap?
    to_index = seq_begin + self.seq_len   

    # (128 x 1088)  
    sequences = batch[:, from_index : to_index]   

    # G) why is this dropping off the last sample??
    input_sequences = sequences[:, : -1]   

    # H) why is the label such an odd subset? 
    target_sequences = sequences[:, self.overlap_len:].contiguous()   
    
    # I) Is X not trying to predict the next sequence making that missing chunk Y?
    # ie: full_seq = [1,2,3,4,5,6].   X = [1,2,3,4].   Y = [5, 6]?
    # currently this is not how the data are laid out.   
    yield (input_sequences, reset, target_sequences)

Thanks! @koz4k

Division by Zero when training

  File "samplernn-pytorch/trainer/__init__.py", line 45, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/usr/local/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

This is with PyTorch 0.3.0.post4.

Generation is always the same

Hello,

I'm running train.py with different datasets and, although the loss decreases differently for each dataset, the generated samples are always exactly the same across different datasets. Is this a known issue?

sample generated is noise

Hi all,

I have been training the model using google colab.
However, the generated sample is noise all the time.
Please give me some suggestion;)

Best regards,

Zixun

training stops in colab

I am using these parameters python train.py --batch_size 4 --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset piano
and training stops in colab with ^C automatically

Thank you for making this. Is there an easy way to get this to run on TPUs in Colab?

Why is the input data not normalized？

The data inputted into the model is 256-bit quantized, ranging from 0 to 255. Why is the input data not normalized to [0,1] or [-1,1] ?

Bad amplitude normalization

Problem

Amplitudes are min-max normalized, for each audio example loaded from the dataset.

Bad for three reasons:

First reason: DC offset. The normalization was calculated by subtracting the minimum and dividing by the maximum. But if minimum peak and maximum peak are different, silence is no longer the middle value, so you introduce a DC offset into the audio.

Second reason: Each example has different peaks, so each example will have a different quantization value for silence.

Third reason: dynamics. If part of my dataset is soft, part is loud, and part is transitions between soft and loud, they will all be normalized to loud. Now SampleRNN will struggle to learn those transitions. If some [8-second] example is nearly silent, now it is super loud.

I think the only acceptable amplitude normalization would be to the entire dataset and you could do so [with ffmpeg] when creating the dataset.

The normalization happens in linear_quantize

Audio normalized upon loading:

def __getitem__(self, index):
        (seq, _) = load(self.file_names[index], sr=None, mono=True)
        return torch.cat([
            torch.LongTensor(self.overlap_len) \
                 .fill_(utils.q_zero(self.q_levels)),
            utils.linear_quantize(
                torch.from_numpy(seq), self.q_levels
            )
        ])

(Example) linear_dequantize(linear_quantize(samples)) != samples

# quantize the wav amplitude into 256 levels
q_levels = 256
# Plot original wav samples
plot(samples)
# samples = tensor([ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000])

# Linearly quantize the samples
lq = linear_quantize(samples, q_levels)
plot(lq)
# lq = tensor([ 133,  133,  133,  ...,  133,  133,  133])
# note, silence should be 128

# Unquantize the samples
ldq = linear_dequantize(lq, q_levels)
plot(ldq)
# tensor([ 0.0391,  0.0391,  0.0391,  ...,  0.0391,  0.0391,  0.0391])
# introduction of DC offset. 
# instead, this should be silent 0.0000, 0,0000, 0.0000, ...

Solution

Don't normalize with linear_quantize

def linear_quantize(samples, q_levels):
    samples = samples.clone()
    samples += 1
    samples /= 2
    samples *= q_levels - EPSILON
    samples += EPSILON / 2
    return samples.long()

README needs to have PyTorch version updated

The README still mentions PyTorch 0.1.12+ but since weight normalization was added to this code half a year ago this is no longer true as torch.nn.utils.weight_norm wasn't available until PyTorch 0.2

How do you sample from the model you created?

How do you sample from the model you created? I want to sample of the model that I have generated.

Sizes of tensors mismatch in dimension 0

Dear all, I generate a set of small chunks with ffmpeg myself based on given configurations, such as the length (8s). However, I get the following error after I start training:

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 353344 and 352320 in dimension 1 at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/TH/generic/THTensorMath.c:3586

Anyone has any ideas?

Thanks.

Convolution dim shuffling

In the FrameLevel forward you guys do:

    def forward(self, prev_samples, upper_tier_conditioning, hidden):
        (batch_size, _, _) = prev_samples.size()

        # (batch, seq_len, dim) -> (batch, dim, seq_len)
        input = prev_samples.permute(0, 2, 1)

        # (batch, dim, seq_len)
        # use conv1d instead of FC for speed
        input = self.input_expand(input)

        # (batch, dim, seq_len) -> (batch, seq_len, dim)
        input = input.permute(0, 2, 1)
        
        # add conditioning tier from previous frame 
        if upper_tier_conditioning is not None:
            input += upper_tier_conditioning
        
        # reset hidden state for TBPTT
        reset = hidden is None
        if hidden is None:
            (n_rnn, _) = self.h0.size()
            hidden = self.h0.unsqueeze(1) \
                            .expand(n_rnn, batch_size, self.dim) \
                            .contiguous()
        
        # -
        (output, hidden) = self.rnn(input, hidden)
        
        # permute again so this can upsample for next context
        output = output.permute(0, 2, 1)
        output = self.upsampling(output)
        output = output.permute(0, 2, 1)
        return (output, hidden)

are the comments I added correct?
I'd like to just use the Linear layer instead of the Conv1d first for understanding purposes. However, the dimensions don't line up when I do it that way. Any thoughts on how to reframe this in terms of a Linear layer?
I assume the transposes you do are so that the convolutions work out? is that standard when using Conv1d instead of Linear layer?

MIDI file as input data

I have trained my model several times, there is always noise in the generated sample audio.
Is it because input audio is in wav format?
Else why don't we use midi file as input file to reduce noise ?

best regards,

Working training in colab but no sound

Hi, I manage to setup a Colab for trainging. The trainng occurs, at least the first 100 generated samples do not have sound just clicks. Do you know how many epochs should take? Or maybe is something uncompatible with versions o other thing wrong going on

https://colab.research.google.com/drive/1fRhzNtRmdllD74mLzfyCy8SWuMT7sB3m

Best

Conflict between PyTorch version and CUDA version

Is it possible to run train.py with CUDA 9+?

train.py attempts to import torch.utils.trainer, which seems to have been removed from PyTorch at around version 1.0.0. However, I think 1.0.0 or newer is required for it to run on GPUs with CUDA 9+.

What's the easiest way to resolve this?

CUDNN_STATUS_NOT_INITIALIZED

I can get train.py to run if I don't use cuda. I'm using pytorch docker container, with nvidia-docker.
nvcc-V reads "Cuda compilation tools, release 9.0, V9.0.176"
nvidia-smi says "Driver Version: 384.111".
While in python, running print(torch.backends.cudnn.is_acceptable(torch.cuda.FloatTensor(1)) and print(torch.backends.cudnn.version()) read "yes" and "7101"
I'm using python 3.6.4

My command line is:
root@6b27a1f07b65:~/samplernn-pytorch# python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset mega

The process fails and produces:
Traceback (most recent call last):
File "train.py", line 360, in
main(**vars(parser.parse_args()))
File "train.py", line 184, in main
model = model.cuda()
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 230, in cuda
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 152, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 152, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 152, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 100, in _apply
self.flatten_parameters()
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 93, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: CUDNN_STATUS_NOT_INITIALIZED

Any help would be appreciated.

"Could not find a version that satisfies the requirement torch==0.2.0.post3"

Sorry if this is a dumb question, i'm new to this. While installing the requirements from the txt file, this shows up. How do I fix this?

How to use model for generation

Hi All,

I have completed training (about 30hrs), how can I generate audio now?

I saw only one file was generated in the results folder.

Thanks

SampleRNN as audio feature extractor

hi,
this is more a question then an issue -
i'm looking for a way to extract features from raw audio wav files and then use these features for different tasks such as voice recognition, voice activity detection an such, not for generative tasks,
i thought of somehow modifying a generative model like SampleRNN\WaveNet so it could be used to only encode the data to some feature space.
can you please give some pointers on what modifications i need to do to the model to achieve that? has anyone already done this before?
any help would be greatly appreciated.

ZeroDivisionError: division by zero

piano is too big for my GPU, I think:

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCStorage.cu:58

So I create smallpiano only with ?.wav and ??.wav. However, when I run on it, I get the following:

~/samplernn-pytorch$ python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset smallpiano
Traceback (most recent call last):
  File "train.py", line 360, in <module>
    main(**vars(parser.parse_args()))
  File "train.py", line 258, in main
    trainer.run(params['epoch_limit'])
  File "/home/ubuntu/samplernn-pytorch/trainer/__init__.py", line 57, in run
    self.call_plugins('epoch', self.epochs)
  File "/home/ubuntu/samplernn-pytorch/trainer/__init__.py", line 44, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

How do I debug this?

ffmpeg is not present in Ubuntu

Could there be alternatives to using ffmpeg offered in download_dataset.sh?

quantizer

The quantizer doesn't work with q_levels=512. I'm not sure why but, to me it seems that it should. Maybe the epsilon is too small? For q_levels=512 you get quantized values that are 512 not 511.

I find this variant easier to read and it works :

    samples = (q_levels-1)*samples + 0.1
    samples = samples.long()

download dataset from youtube

when I run the code cd datasets
./download-from-youtube.sh "https://www.youtube.com/watch?v=EhO_MrRfftU" 8 piano

"." is not recognized as an internal or external command, operable or batch file shows up.

Anyone can tell me how to deal with this?

bug in hidden state?

Hi,

Thanks for the good work! I am wondering should https://github.com/deepsound-project/samplernn-pytorch/blob/master/model.py#L49 be:

h0 = torch.zeros(n_rnn, batch_size, dim)?

I noticed the expand function is used later, but it seems using the expand function the values are shared. Is this a bug?

Many thanks!

Qiuqiang

Explanation of `frame_sizes` and `ns_frame_samples`

Hello,
can you please explain the purpose of frame_sizes and ns_frame_samples in the SampleRNN constructor?

I get the meaning of frame_sizes from the paper. However, there's something strange (at least to me): in the paper, especially in the main figure, it seems the the frame size at tier 3 is 16 and the frame size at tier 2. In the code, you use the same values (frame_sizes = [16, 4]), however it seems that the order is reversed, because in Predictor's forward() you scan the RNNs in reversed order, so apparently you use 4 for tier 3 and 16 for tier 2. Is there something I'm not getting right here...?

Besides, what's the purpose of n_frame_samples for each RNN?

Thanks!

Successful implementations?

Hi,

Is there anyone who successfully generated audio using this implementation of SampleRNN? If so, what parameter set did you use? What was your final training and validation loss? Furthermore, did you adjust any code in order to produce audio of good quality?

I ran several instances of this implementation to learn pattern in classical music, using different parameters. The lowest training and validation loss I retrieved was 0.72 and 0.86 (NLL in bits), respectively. The corresponding parameter set was equivalent to the optimal music parameters set according to Mehri et al. However, the generated audio files are still noisy. I know there are people who generated qualitative audio files using the original Theano implementation, although their losses were higher. I, therefore, am wondering if there are people who have or had similar issues. Your help is appreciated :).

Kind regards,

Stafla

Why is prev_samples = 2 * dequantized ?

https://github.com/deepsound-project/samplernn-pytorch/blob/master/model.py#L203

What is the purpose of multiplying by 2?

YouTube-Mix dataset possibly leaks training data to validation and testing

Hi, I got aware of the youtube mix dataset, which is proposed in this work, via the following papers

In these works, they use 88% of the files as training, and 6% respectively for validation and testing. At least for the current version of the youtube video referenced https://www.youtube.com/watch?v=EhO_MrRfftU, the video is about 45 min of pieces which are repeated 6 times and cut at 4 hours. The 88/6/6% strategy hence yields validation and test sets that are completely contained in the training dataset.

Since this repo is referenced by their works, it might be valuable for future researchers to be made aware of this issue with the YouTube-Mix dataset proposed in this repo.