Comments (9)
Thanks for sharing. Good to know! :)
People have tried it on different datasets, including Korean, classical music, and ambient. @richardassar even got interesting results from training on a couple of hours of Tangerine Dream works. See: https://soundcloud.com/psylent-v/tracks
from samplernn_iclr2017.
Do you know if the generated sound (of t-dream) is purely by the network vs mixed with some supporting passages (such as drums) by a human?
from samplernn_iclr2017.
from samplernn_iclr2017.
@LinkOne1A These guitar samples are really nice
from samplernn_iclr2017.
Credit to the N/network and folks with the sampelRNN paper!
I'm surprised that trainng on multi instrument data worked so well, I'm puzzled why that is. My intuition was that multiple instruments playing at the same time limit the space of valid ("pleasing" maybe is a better word) combinations (output), not sure how I would go about proving (or disprovinf this).
What has been your experience in this area?
from samplernn_iclr2017.
For Tangerine Dream the validation loss (in bits) was near to 3.2 after
300k iterations vs below 1.0 for solo piano. Note, I used the
three_tier.py
model in both cases.
It seems weight normalisation, both in the linear layers and the
transformations inside the GRUs, helps with generalisation. I'm conducting
some experiments to verify this for myself.
Some of SampleRNN's capacity to generalise might be due to the quantization
noise introduced in going to 8 bits. It may be interesting to try something
like https://arxiv.org/pdf/1701.06548.pdf to further improve generalisation
however I've yet to observe an increase in validation loss so we're
probably slightly underparameterised on these datasets.
Ignoring independent_preds the three_tier model has around 18 million
parameters which is evidently sufficient to capture the temporal dynamics
of the signal to an acceptable degree. If you think about it from an
information theoretic point of view, kolmogorov complexity / minimum
description length, there's a lot of redundancy in the signal that can be
"compressed" away by the network.
The model seems to capture various synthesiser sounds, crowd noise from
the live recordings, both synthetic and real drums including correct
percussive patterns however it could not maintain a consistent tempo - this
could be helped with some conditioning by an auxiliary time series.
If you used preprocess.py
in datasets/music/
then you may want to run
your experiments again. See:
#14
from samplernn_iclr2017.
Thanks for the detailes! Interestring and surprising that loss validation of 3.2 produces the t/dream segment.
What was the length of the original (total) audio?
How long did it take to get to 300K steps, and what kinda GPU do you have?
The quantization noise (going 8 bit), are you refereing to the mu-law encode/decode? I ran a stand alone test of mu-law affected wav file vs original wav, could not hear the differance and an inverted/summation of the two sources in audacity showed very little amplitude, mostly I think on the high end of the spectrum.
I haven't thought about the descriptive complexity of the source signal. Now that you mention it, I'd say that if a network had to deal with already compressed data and had to figure out a predictive solution to the next likely outcome in a series, I don't know... but my hunch is that it would be more difficult for the network, which means we would need more complexity (layers) in the network. This is my off-the-cuff though!
I'll check it out ( #14 ).
I have not yet looked into the 3 tier training.
from samplernn_iclr2017.
Total audio was about 32 hours, although due to the bug I didn't end up
training on all of it!
It seems that the required loss for acceptable samples is really relative
to the dataset, multiple instruments increase the entropy of the signal and
unlike piano it seems to get "lost" far less frequently because it has
more of a varied space in which to recover. Before fully converging the
piano samples sometimes go unstable, this effect was almost non-existent
when training on Tangerine Dream.
No, I'm referring to quantization noise as x - Q(x)
for any quantization
scheme. Introducing noise of any kind acts as a regulariser e.g.
https://en.wikipedia.org/wiki/Tikhonov_regularization
It's true that compressed data has more entropy but if you decompress the
signal again, assuming lossy compression, the resulting entropy is lower
than the original signal and should be easier to model. I was referring to
the compressability of the signal, it seems there's plenty scope for that.
Something that would be interesting to try, akin to speaker conditioning as
mentioned in the Char2Wav paper, is conditioning on instruments on genre
with an embedded one-hot signal. This might allow interpretation between
styles etc. This is an area of research I'll be looking into over the next
while.
It took a couple of days to get to 300k steps, I'm training on a GTX1080.
The machine I have has two but the script does not split minibatches over
both GPUs. I have implemented SampleRNN myself and it can train 4x faster
(without GRU weightnorm), soon to be released.
from samplernn_iclr2017.
Although I have avoided it so far it's probably worth filtering out low amplitude audio segments from the training set. These get amplified during normalization which pulls up the noise floor introducing lots of high energy noise which can only disrupt or slow down training.
A plot of the RMS power over each segment in the piano database shows the distribution and you can see the tail of low energy signals on the right which could probably be pruned (especially the one segment with zero energy).
from samplernn_iclr2017.
Related Issues (20)
- Training on my own voice wav files HOT 16
- To continue test wav? HOT 5
- What were the parameters for the mu-law-1 posting at soundcloud? HOT 4
- Unable to run on CPU (vs GPU) HOT 1
- Noise segments HOT 12
- preprocessing data HOT 1
- speed of generating speech samples HOT 1
- Funky GRU
- Teacher forcing
- StopIteration HOT 3
- Want a PowerPoint file to understand the motivation, idea HOT 1
- Bad amplitude normalization
- Using My own Dataset and Training versus Generating
- Training falled after 60000 iterations HOT 7
- creating a dataset
- Replicating dataset splits?
- Trouble with this repo : any repo is kept up to date ?
- add table with minimum GPU / VRAM requirements, compute times and performance averages
- Example of music generated using mostly default parameters (tier-2), at epoch9 and epoch11 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from samplernn_iclr2017.