Deep Learning with Audio

DOM-E5129 - Intelligent Computational Media

State of audio generation in Deep Learning (December 2018)

Speech and music (MIDI) generation are doing well, however the methods that work well with images don’t translate that well to the audio domain. Turning sounds to spectrograms and different signal processing algorithms make it possible to use image models, but the results tend to be a bit underwhelming and the sound quality is bad.

A blog post going deeper into why this is the case.

WaveNet (September 2016) was a massive breakthrough in audio generation. It creates waveforms sample by sample which seems to be the reason why it generates so much better results. It’s a convolutional neural network that wasn’t usually used for generation before. It is mainly used to create natural speech, but there was some tests with music generation too. This is one of the applications that has seen widespread real-world use.

Two Minute Paper video about WaveNet

Continuation Paper that makes generation a lot faster (November 2017)

It's part of Google Duplex, the restaurant reservation Assistant (May 2018)

The case of GANs is a good case to show how the audio domain is progressing is a lot slower than computer vision and image generation. Considering the original GAN paper came out in 2014 and there’s multiple amazing applications of it in the recent years. It took until 2018 until anyone managed to combine WaveNet sample generation approach and GAN.

Failed attempt from January 2017

Successful version from January 2018

One of the most promising works is “A Universal Music Translation Network” (May 2018) by Facebook Research. It can take a piece of music played one way and translate it to another style. Piano -> Harpsichord, Band -> Orchestra, Whistling -> Orchestra. It uses a clever system of convolving input into a shared musical “language” that it can then translate to different styles or instruments with separately trained models. Unfortunately the code for the project is not available and trained for 6 days with 8 GPUs.

One huge problem with all of these system is that the results are very idealised, when you pick only the best results, it gives a misleading picture of what is actually possible. Good early example is GRUV all the way from 2015. It seems it could generate music, but it actually just memorizes it (down to the lyrics). A more likely scenario in the current situation is presented in this video (three full days of training with just some plausible stuttering backing vocals to show.)

With massive datasets, the likelihood of your impressive results being just clever sampling from the dataset seems very likely.

The only reasonable and accessible system seems to be Magenta. It has a great set of trained models for different types of musical improvisation. It is also designed to work on the browser for fun, easily accessible toys. The problem is that it’s mainly MIDI-based, which massively limits the possibilities. Magenta also includes NSynth, a system that can combine instruments in fascinating ways. And you can actually use it as an instrument (March 2018).

Almost all of the applications listed here take intense amounts of training. Most of the big papers are training with 10-32 GPUs for around a week.

So any attempted practical application of these systems is likely to be unsuccessful at the current time.

Promising or interesting works

WaveGAN (February 2018)
SampleRNN (February 2017)
- A video training with pop music (March 2018)
A Universal Music Translation Network (May 2018)
Time-Domain audio style transfer (November 2017)
Magenta (2017->)
- NSynth
Creating sounds for silent videos (December 2017)
- This one uses SampleRNN and heavily curated Google Audioset to create plausible sound effects for videos.
RunwayML (Upcoming)
- Basically aiming to do what the course is doing. Though even more focus on removing barriers, going as far as not needing to know almost anything about machine learning.

Strange and interesting offshoot work

Three-armed robot drummer (February 2016)
Placing plausible sounds on a silent video (April 2016. The system doesn’t create new sounds, it just picks the most appropriate sound from it’s database)

Datasets

This is also one huge problem currently. There isn’t many high-quality large audio datasets. Especially for non-music, non-speech sounds, it feels pretty dead.

Google AudioSet
- Is really big and categorized, but the problem is that it’s just 10-second clips of Youtube videos, with the type of sound somewhere in there. And one clip might even multiple types of sound. Good for classification, terrible for generation. Also, there's some legal problems of getting just the audio from these videos.
- The VEGAS dataset is a human-curated subset of AudioSet that is less noisy and generally better for sound generation tasks.
ESC-50
- A Dataset of 50-different environmental sounds. It’s main use is benchmarking classification, but it’s one of the only sources of environmental quality sounds currently. The problem is that it’s very small, 40 sounds per category. Makes it tricky to use for generation.
The NSynth Dataset
- Absolutely massive set of 300 000 sound files. It’s basically notes played on different instruments. It’s done with MIDI instruments, so not the most interesting form that sense, but it’s easily big enough for generation too
Speech Commands Dataset and SC Zero to Nine Speech Commands
- There’s multiple datasets for speech commands and they tend to be large and high-quality. Human speech is just not the most interesting thing to generate, but it’ll likely be the baseline for any future systems.
Kaggle audio datasets
- There's some strange things here and more must be coming, but the quality varies wildly.
There’s also many sources of sound effects for example, but considering the amount you need, collecting them from different sources would be a major undertaking. One fun one is the BBC sound effect archive.

Other notes

The audio sample approach is so unexplored that many frameworks don’t even have a Conv1DTranspose-implementation. So people make their own by running it through Conv2DTranspose.
The only audio tutorial for Tensorflow is based on spectrograms and only does speech recognition.

muskanmahajan37 / deeplearningwithaudio18 Goto Github PK