Giter Site home page Giter Site logo

muskanmahajan37 / deeplearningwithaudio18 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ashishpatel26/deeplearningwithaudio18

0.0 0.0 0.0 79.66 MB

This repository contains the deep learning with audio examples and the course materials for DOM-E5129 - Intelligent Computational Media. Documentation on different Deep Learning Audio systems as well as instructions on using some of them. Tools for loading, playing and plotting audio. Some working simple classifiers Non-working sample-level/raw audio GANs Python scripts for sorting different popular datasets

Python 4.33% Jupyter Notebook 94.65% CSS 0.05% HTML 0.08% JavaScript 0.89%

deeplearningwithaudio18's Introduction

Deep Learning with Audio

DOM-E5129 - Intelligent Computational Media

State of audio generation in Deep Learning (December 2018)

Speech and music (MIDI) generation are doing well, however the methods that work well with images don’t translate that well to the audio domain. Turning sounds to spectrograms and different signal processing algorithms make it possible to use image models, but the results tend to be a bit underwhelming and the sound quality is bad.

A blog post going deeper into why this is the case.

WaveNet (September 2016) was a massive breakthrough in audio generation. It creates waveforms sample by sample which seems to be the reason why it generates so much better results. It’s a convolutional neural network that wasn’t usually used for generation before. It is mainly used to create natural speech, but there was some tests with music generation too. This is one of the applications that has seen widespread real-world use.

Two Minute Paper video about WaveNet

Continuation Paper that makes generation a lot faster (November 2017)

It's part of Google Duplex, the restaurant reservation Assistant (May 2018)

The case of GANs is a good case to show how the audio domain is progressing is a lot slower than computer vision and image generation. Considering the original GAN paper came out in 2014 and there’s multiple amazing applications of it in the recent years. It took until 2018 until anyone managed to combine WaveNet sample generation approach and GAN.

Failed attempt from January 2017

Successful version from January 2018

One of the most promising works is “A Universal Music Translation Network” (May 2018) by Facebook Research. It can take a piece of music played one way and translate it to another style. Piano -> Harpsichord, Band -> Orchestra, Whistling -> Orchestra. It uses a clever system of convolving input into a shared musical “language” that it can then translate to different styles or instruments with separately trained models. Unfortunately the code for the project is not available and trained for 6 days with 8 GPUs.

One huge problem with all of these system is that the results are very idealised, when you pick only the best results, it gives a misleading picture of what is actually possible. Good early example is GRUV all the way from 2015. It seems it could generate music, but it actually just memorizes it (down to the lyrics). A more likely scenario in the current situation is presented in this video (three full days of training with just some plausible stuttering backing vocals to show.)

With massive datasets, the likelihood of your impressive results being just clever sampling from the dataset seems very likely.

The only reasonable and accessible system seems to be Magenta. It has a great set of trained models for different types of musical improvisation. It is also designed to work on the browser for fun, easily accessible toys. The problem is that it’s mainly MIDI-based, which massively limits the possibilities. Magenta also includes NSynth, a system that can combine instruments in fascinating ways. And you can actually use it as an instrument (March 2018).

Almost all of the applications listed here take intense amounts of training. Most of the big papers are training with 10-32 GPUs for around a week.

So any attempted practical application of these systems is likely to be unsuccessful at the current time.

Promising or interesting works

Strange and interesting offshoot work

Datasets

This is also one huge problem currently. There isn’t many high-quality large audio datasets. Especially for non-music, non-speech sounds, it feels pretty dead.

  • Google AudioSet

    • Is really big and categorized, but the problem is that it’s just 10-second clips of Youtube videos, with the type of sound somewhere in there. And one clip might even multiple types of sound. Good for classification, terrible for generation. Also, there's some legal problems of getting just the audio from these videos.
    • The VEGAS dataset is a human-curated subset of AudioSet that is less noisy and generally better for sound generation tasks.
  • ESC-50

    • A Dataset of 50-different environmental sounds. It’s main use is benchmarking classification, but it’s one of the only sources of environmental quality sounds currently. The problem is that it’s very small, 40 sounds per category. Makes it tricky to use for generation.
  • The NSynth Dataset

    • Absolutely massive set of 300 000 sound files. It’s basically notes played on different instruments. It’s done with MIDI instruments, so not the most interesting form that sense, but it’s easily big enough for generation too
  • Speech Commands Dataset and SC Zero to Nine Speech Commands

    • There’s multiple datasets for speech commands and they tend to be large and high-quality. Human speech is just not the most interesting thing to generate, but it’ll likely be the baseline for any future systems.
  • Kaggle audio datasets

    • There's some strange things here and more must be coming, but the quality varies wildly.
  • There’s also many sources of sound effects for example, but considering the amount you need, collecting them from different sources would be a major undertaking. One fun one is the BBC sound effect archive.

Other notes

  • The audio sample approach is so unexplored that many frameworks don’t even have a Conv1DTranspose-implementation. So people make their own by running it through Conv2DTranspose.
  • The only audio tutorial for Tensorflow is based on spectrograms and only does speech recognition.

Other interesting links

  • Creative.ai
    • An organization dedicated to creating interesting creative applications of AI in as many different fields as possible.
  • Keras-GAN on GitHub
    • Repository of most of the biggest image GANs, implemented in Keras.
  • SeedBank
    • A collection of Interactive Machine learning examples running on Google CoLab (with free GPUs)

deeplearningwithaudio18's People

Contributors

sopimlab avatar tcmxx avatar vaakapallo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.