Giter Site home page Giter Site logo

zcaceres / spec_augment Goto Github PK

View Code? Open in Web Editor NEW
487.0 15.0 63.0 29.4 MB

🔦 A Pytorch implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Home Page: https://arxiv.org/abs/1904.08779

License: MIT License

Jupyter Notebook 99.26% Shell 0.01% Python 0.74%

spec_augment's Introduction

SpecAugment with Pytorch

A Pytorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Medium Article

SpecAugment is a state of the art data augmentation approach for speech recognition.

The paper's authors did not publish code that I could find and their implementation was in TensorFlow. We implemented all three SpecAugment transforms using Pytorch, torchaudio, and fastai / fastai-audio.

To use:

  1. Run install.sh (I recommend using a unique conda env for the project)

After the install script runs, you should have a torchaudio folder in your project folder.

  1. Check out SpecAugment.ipynb (a Jupyter notebook) for the functions.

Augmentations

Time Warp time warp aug

Time Mask time mask aug

Frequency Mask freq mask aug

Combined: combined augs

Note on Time Warp

The Time Warp augmentation relies on Tensorflow-specific functionality not supported in Pytorch. We implemented supporting functions for this augmentation in SparseImageWarp.ipynb. You do not need to look at this notebook to use the augmentations. But the Time Warp augmentation depends on code exposed in the SparseImageWarp notebook.

Let's be friends!

spec_augment's People

Contributors

prathamsoni avatar thommackey avatar zcaceres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spec_augment's Issues

Change in nb_SparseImageWarp.py to adapt newer version of pytorch

Hi, I found some code in nb_SparseImageWarp.py need changes in order to work in Pytorch 1.4.0:

In line 115:
X, LU = torch.gesv(rhs, lhs) ->
X, LU = torch.solve(rhs, lhs)
(torch.gesv has been deprecated)

In line 160:
return 0.5 * torch.square(r) * torch.log(torch.max(r, EPSILON)) ->
return 0.5 * r.pow(2) * torch.log(torch.max(r, EPSILON))

There are also warnings:
nb_SparseImageWarp.py:316: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
alpha = torch.tensor(queries - floor, dtype=grid_type)

But I think it will still be working properly.

Hope that might help.

Problem about the wrong result

Hi, thanks for your perfect job. However, when I test the function of time_warp with the input following:
image
I got the result which seems wrong
image
I use the mfsc feature and the the image size is 400*128dim,W and other parameters are set default. I have no idea about the reason of wrong result. Would you like to give me some suggestions? Thank you very much

Convert speech to text?

Have I misunderstood that SpecAugment could be used to convert speech to text? It seems from your Medium article that it is simply something that enhances the process of one training a model rather than converts speech to text. Am I correct on this? I'm searching for the nearest model available online (for free) with the lowest WER which I can simply give audio and get text back with probabilities. Would appreciate any tips on that!

Time warp fails on short spectrogram

Hi, I have a question whether the time warping can sometimes cause a spectrogram to be too distorted or not.

Screen Shot 2563-07-22 at 14 37 29

I applied Time Warping with W=2 and it sometimes results in ugly padding on the right side of the spectrogram. Is there any way to alleviate this problem?

Time warp unexpected behavior and suggestion for sparse_image_warp alternative

Hi, I have noticed an issue with your time warping and it's already mentioned in #12. I think that not how time warp should be (maybe my opinion is wrong since I'm not familiar with TF so I can't try tfa.image.sparse_image_warp to see the expected result myself).

After searching around and do experiment on my own, I find that PyTorch has nn.functional.grid_sample function that can work similarly to tfa.image.dense_image_warp. So the problem here can be narrowed down to not having a function that can do spline interpolation (interpolate_spline) to convert sparse control points into flow matrix (actually PyTorch have nn.functional.interpolate but the bicubic mode here tend to cause overshoot so I'm not using it).

My solution to this is: Make a function that can interpolate from tensor([0, pt, spec_len]) to a tensor of size spec_len. The code is below (referenced from StackOverflow):

# Reimplement from: https://stackoverflow.com/questions/61616810/how-to-do-cubic-spline-interpolation-and-integration-in-pytorch

def h_poly(t):
    tt = t[None, :]**torch.arange(4, device=t.device)[:, None]
    A = torch.tensor([
        [1, 0, -3, 2],
        [0, 1, -2, 1],
        [0, 0, 3, -2],
        [0, 0, -1, 1]
    ], dtype=t.dtype, device=t.device)
    return A @ tt


def interp(x, y, xs):
    m = (y[1:] - y[:-1]) / (x[1:] - x[:-1])
    m = torch.cat([m[[0]], (m[1:] + m[:-1]) / 2, m[[-1]]])
    idxs = torch.searchsorted(x[1:], xs)
    dx = (x[idxs + 1] - x[idxs])
    hh = h_poly((xs - x[idxs]) / dx)
    return hh[0] * y[idxs] + hh[1] * m[idxs] * dx + hh[2] * y[idxs + 1] + hh[3] * m[idxs + 1] * dx

After that, I refactor your time_wrap function to use grid_sample:

def time_warp(spec, W=50):
    # Input spec has shape (channel, freq_bin, frame)

    num_rows = spec.shape[-2]
    spec_len = spec.shape[-1]
    
    mid_y = num_rows//2
    mid_x = spec_len//2
    device = spec.device

    pt = torch.randint(W, spec_len - W, (1,), device=device)
    w = torch.randint(-W, W, (1,), device=device) # distance

    # Make source control point with 3 points in time axis: 2 anchor points and 1 control point
    src_ctr_pt_time = torch.tensor([0, warp_p, spec_len-1])
    dst_ctr_pt_time = torch.tensor([0,warp_p-warp_d, spec_len-1])
    dst_ctr_pt_time = dst_ctr_pt_time*2/(spec_len-1) - 1 # Normalize into the range [-1, 1] to match with grid_sample requirement
    
    # Interpolate
    src_ctr_pts = torch.linspace(0, spec_len-1, spec_len)
    dst_ctr_pts= interp(src_ctr_pt_time ,dst_ctr_pt_time , src_ctr_pts)

    # Destination
    grid = torch.cat((ys.view(1,1,-1,1).expand(1,num_rows,-1,1),
     torch.linspace(-1, 1, num_rows).view(-1,1,1).expand(1,-1,spec_len,1)), -1)

    # warp
    # unsqueeze since grid_sample require 4D tensor, meanwhile our tensor is only 3D
    warped_spectro = torch.nn.functional.grid_sample(spec.unsqueeze(0), grid, align_corners=True)
    return warped_spectro.squeeze(0)

Here is the result with pt=195 and w=82:
Original Spectro
My implementation
spec_augment

As you can see, the warped spectrogram looks more reasonable now when the warp distance is large (82 in comparison to audio with roughly 400 frames).

In addition to that, the run time is much faster. I run the code on colab using CPU and the original time_warp takes around 1.64s to run, while my implement takes only 12ms.
Benchmarking

Lastly, I send you the final code that can perform augment on a batch of spectrograms at the end of this issue.
I haven't tested if this code uses less memory than sparse_image_warp or not, but the speed up given is a real deal. Hope this helps with simpler and faster implementation for our problem.

def h_poly(t):
    tt = t.unsqueeze(-2)**torch.arange(4, device=t.device).view(-1,1)
    A = torch.tensor([
        [1, 0, -3, 2],
        [0, 1, -2, 1],
        [0, 0, 3, -2],
        [0, 0, -1, 1]
    ], dtype=t.dtype, device=t.device)
    return A @ tt


def hspline_interpolate_1D(x, y, xs):
    '''
    Input x and y must be of shape (batch, n) or (n)
    '''
    m = (y[..., 1:] - y[..., :-1]) / (x[..., 1:] - x[..., :-1])
    m = torch.cat([m[...,[0]], (m[...,1:] + m[...,:-1]) / 2, m[...,[-1]]], -1)
    idxs = torch.searchsorted(x[..., 1:], xs)
    dx = (x.take_along_dim(idxs+1, dim=-1) - x.take_along_dim(idxs, dim=-1))
    hh = h_poly((xs - x.take_along_dim(idxs, dim=-1)) / dx)
    return hh[...,0,:] * y.take_along_dim(idxs, dim=-1) \
        + hh[...,1,:] * m.take_along_dim(idxs, dim=-1) * dx \
        + hh[...,2,:] * y.take_along_dim(idxs+1, dim=-1) \
        + hh[...,3,:] * m.take_along_dim(idxs+1, dim=-1) * dx

def time_warp(specs, W=50):
  '''
  Timewarp augmentation

  param:
    specs: spectrogram of size (batch, channel, freq_bin, length)
    W: strength of warp
  '''
  device = specs.device
  batch_size, _, num_rows, spec_len = specs.shape

  mid_y = num_rows//2
  mid_x = spec_len//2

  warp_p = torch.randint(W, spec_len - W, (batch_size,), device=device)

  # Uniform distribution from (0,W) with chance to be up to W negative
  # warp_d = torch.randn(1)*W # Not using this since the paper author make random number with uniform distribution
  warp_d = torch.randint(-W, W, (batch_size,), device=device)
  x = torch.stack([torch.tensor([0], device=device).expand(batch_size),
                 warp_p, torch.tensor([spec_len-1], device=device).expand(batch_size)], 1)
  y = torch.stack([torch.tensor([-1.], device=device).expand(batch_size),
                 (warp_p-warp_d)*2/(spec_len-1)-1, torch.tensor([1], device=device).expand(batch_size)], 1)

  # Interpolate from 3 points to spec_len
  xs = torch.linspace(0, spec_len-1, spec_len, device=device).unsqueeze(0).expand(batch_size, -1)
  ys = hspline_interpolate_1D(x, y, xs)

  grid = torch.cat(
      (ys.view(batch_size,1,-1,1).expand(-1,num_rows,-1,-1),
       torch.linspace(-1, 1, num_rows, device=device).view(-1,1,1).expand(batch_size,-1,spec_len,-1)), -1)

  return torch.nn.functional.grid_sample(specs, grid, align_corners=True)

Convert mel to audio

Hi, is there any thing I can do to convert the mel spectrogram back to audio? I am using librosa to covert it but it takes ages just to reform 8-10s audio. Thanks

Is it able to warp color image(3 channel) ?

It seems that the shape of input image is (b, h, w), while the same function in TensorFlow can deal with color image. Is the library able to warp color image(3 channel) by simply add a dismension on code?

fatal error: sox.h: No such file or directory

Hi,

I am trying to run your code install.sh file in conda environment and getting an error

cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/sox_effects.h:, from /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/register.cpp:4: /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/sox_utils.h:4:10: fatal error: sox.h: No such file or directory 4 | #include <sox.h> | ^~~~~~~ compilation terminated. error: command 'gcc' failed with exit status 1 ./install.sh: line 7: jupyter: command not found

Can you suggest me some solution ?

Is the methodology of getting source and dest points logical in Time Warp?

`def time_warp(spec, W=5):
num_rows = spec.shape[1]
spec_len = spec.shape[2]
device = spec.device
y = num_rows//2
horizontal_line_at_ctr = spec[0][y]
assert len(horizontal_line_at_ctr) == spec_len

point_to_warp = horizontal_line_at_ctr[random.randrange(W, spec_len - W)]
assert isinstance(point_to_warp, torch.Tensor)

# Uniform distribution from (0,W) with chance to be up to W negative
dist_to_warp = random.randrange(-W, W)
src_pts, dest_pts = (torch.tensor([[[y, point_to_warp]]], device=device), 
                     torch.tensor([[[y, point_to_warp + dist_to_warp]]], device=device))
warped_spectro, dense_flows = sparse_image_warp(spec, src_pts, dest_pts)
return warped_spectro.squeeze(3)`

Based on my understanding, the point given should be a time coordinate. Currently, the point to wrap is returning an actual value from the spectrogram, which does not align with the computation of source and dest points.

Problem with the time_warp function

Hi, should the following line (which defines the position of the warping point)

point_to_warp = horizontal_line_at_ctr[random.randrange(W, spec_len - W)]  

be changed to

point_to_warp = random.randrange(W, spec_len - W)

It doesn't make sense in the following computation if we add the specific value with a coordinate shift (dist_to_warp).

 src_pts, dest_pts = torch.tensor([[[y, point_to_warp]]]), torch.tensor([[[y, point_to_warp + dist_to_warp]]])

question about SPEC2DB

SPEC2DB is better than transforms.SpectrogramToDB? Why are you reimplemented this?

Thank you for all your assistance

recommended install method does not work

Cloning into 'torchaudio'...
remote: Enumerating objects: 87, done.
remote: Counting objects: 100% (87/87), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 1080 (delta 47), reused 47 (delta 23), pack-reused 993
Receiving objects: 100% (1080/1080), 6.13 MiB | 5.73 MiB/s, done.
Resolving deltas: 100% (442/442), done.
running install
running bdist_egg
running egg_info
creating torchaudio.egg-info
writing torchaudio.egg-info/PKG-INFO
writing dependency_links to torchaudio.egg-info/dependency_links.txt
writing requirements to torchaudio.egg-info/requires.txt
writing top-level names to torchaudio.egg-info/top_level.txt
writing manifest file 'torchaudio.egg-info/SOURCES.txt'
reading manifest file 'torchaudio.egg-info/SOURCES.txt'
writing manifest file 'torchaudio.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/transforms.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/common_utils.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/functional.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/sox_effects.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/legacy.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/kaldi_io.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio
creating build/lib.linux-x86_64-3.7/test
copying test/common_utils.py -> build/lib.linux-x86_64-3.7/test
copying test/test_jit.py -> build/lib.linux-x86_64-3.7/test
copying test/test_functional.py -> build/lib.linux-x86_64-3.7/test
copying test/test_kaldi_io.py -> build/lib.linux-x86_64-3.7/test
copying test/test_sox_effects.py -> build/lib.linux-x86_64-3.7/test
copying test/__init__.py -> build/lib.linux-x86_64-3.7/test
copying test/test.py -> build/lib.linux-x86_64-3.7/test
copying test/test_transforms.py -> build/lib.linux-x86_64-3.7/test
copying test/test_dataloader.py -> build/lib.linux-x86_64-3.7/test
copying test/test_legacy.py -> build/lib.linux-x86_64-3.7/test
creating build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/yesno.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/vctk.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
creating build/lib.linux-x86_64-3.7/torchaudio/compliance
copying torchaudio/compliance/kaldi.py -> build/lib.linux-x86_64-3.7/torchaudio/compliance
copying torchaudio/compliance/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio/compliance
running build_ext
building '_torch_sox' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/torchaudio
torchaudio/torch_sox.cpp:1:29: fatal error: torch/extension.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.