zcaceres / spec_augment Goto Github PK

🔦 A Pytorch implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Home Page: https://arxiv.org/abs/1904.08779

License: MIT License

Jupyter Notebook 99.26% Shell 0.01% Python 0.74%

spec_augment's Introduction

SpecAugment with Pytorch

A Pytorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Medium Article

SpecAugment is a state of the art data augmentation approach for speech recognition.

The paper's authors did not publish code that I could find and their implementation was in TensorFlow. We implemented all three SpecAugment transforms using Pytorch, torchaudio, and fastai / fastai-audio.

To use:

Run install.sh (I recommend using a unique conda env for the project)

After the install script runs, you should have a torchaudio folder in your project folder.

Check out SpecAugment.ipynb (a Jupyter notebook) for the functions.

Augmentations

Time Warp

Time Mask

Frequency Mask

Combined:

Note on Time Warp

The Time Warp augmentation relies on Tensorflow-specific functionality not supported in Pytorch. We implemented supporting functions for this augmentation in SparseImageWarp.ipynb. You do not need to look at this notebook to use the augmentations. But the Time Warp augmentation depends on code exposed in the SparseImageWarp notebook.

Let's be friends!

spec_augment's People

Contributors

Stargazers

Watchers

Forkers

lijuan123 bluefaust faustpy jy00002 kelvinson maisyzhang boji123 devkihyun gaoyiyeah bemoregt hunterhawk thommackey wuhaie harirajeev anyueanne entn-at tomotious aascode tongjinle123 b2220333 hwong39 gandor26 kimjeongsun eexinzheng luckmoon kaen2891 aihill aaron-alphonsus chienlinhuang1116 sachuronggui xuanjihe runngezhang mnabihali blank-wang pengyizhou llmhao shuiblue songtaoshi eloqute fufu80 fanofjava wenwanchen sanzimu jason-lee-lxx y00281951 manojkl proling1994 yihwenwang 18445864529 prathamsoni xminte ishine kangyunho1221 olga009-k insutil-lab vlozg lvchigo shrejais jdariasl caro-hang tomxi zahraabolfazli

spec_augment's Issues

Change in nb_SparseImageWarp.py to adapt newer version of pytorch

Hi, I found some code in nb_SparseImageWarp.py need changes in order to work in Pytorch 1.4.0:

In line 115:
X, LU = torch.gesv(rhs, lhs) ->
X, LU = torch.solve(rhs, lhs)
(torch.gesv has been deprecated)

In line 160:
return 0.5 * torch.square(r) * torch.log(torch.max(r, EPSILON)) ->
return 0.5 * r.pow(2) * torch.log(torch.max(r, EPSILON))

There are also warnings:
nb_SparseImageWarp.py:316: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
alpha = torch.tensor(queries - floor, dtype=grid_type)

But I think it will still be working properly.

Hope that might help.

Problem about the wrong result

Hi, thanks for your perfect job. However, when I test the function of time_warp with the input following:

I got the result which seems wrong

I use the mfsc feature and the the image size is 400*128dim，W and other parameters are set default. I have no idea about the reason of wrong result. Would you like to give me some suggestions? Thank you very much

Convert speech to text?

Have I misunderstood that SpecAugment could be used to convert speech to text? It seems from your Medium article that it is simply something that enhances the process of one training a model rather than converts speech to text. Am I correct on this? I'm searching for the nearest model available online (for free) with the lowest WER which I can simply give audio and get text back with probabilities. Would appreciate any tips on that!

Time warp fails on short spectrogram

Hi, I have a question whether the time warping can sometimes cause a spectrogram to be too distorted or not.

I applied Time Warping with W=2 and it sometimes results in ugly padding on the right side of the spectrogram. Is there any way to alleviate this problem?

How to include your code in our library

In my initial experiments I got some promising results with our ASR library. Can I include a modified py version in our library? How can I reference your work? I didn't see any license anywhere.

Time warp unexpected behavior and suggestion for sparse_image_warp alternative

Hi, I have noticed an issue with your time warping and it's already mentioned in #12. I think that not how time warp should be (maybe my opinion is wrong since I'm not familiar with TF so I can't try tfa.image.sparse_image_warp to see the expected result myself).

After searching around and do experiment on my own, I find that PyTorch has nn.functional.grid_sample function that can work similarly to tfa.image.dense_image_warp. So the problem here can be narrowed down to not having a function that can do spline interpolation (interpolate_spline) to convert sparse control points into flow matrix (actually PyTorch have nn.functional.interpolate but the bicubic mode here tend to cause overshoot so I'm not using it).

My solution to this is: Make a function that can interpolate from tensor([0, pt, spec_len]) to a tensor of size spec_len. The code is below (referenced from StackOverflow):

# Reimplement from: https://stackoverflow.com/questions/61616810/how-to-do-cubic-spline-interpolation-and-integration-in-pytorch

def h_poly(t):
    tt = t[None, :]**torch.arange(4, device=t.device)[:, None]
    A = torch.tensor([
        [1, 0, -3, 2],
        [0, 1, -2, 1],
        [0, 0, 3, -2],
        [0, 0, -1, 1]
    ], dtype=t.dtype, device=t.device)
    return A @ tt


def interp(x, y, xs):
    m = (y[1:] - y[:-1]) / (x[1:] - x[:-1])
    m = torch.cat([m[[0]], (m[1:] + m[:-1]) / 2, m[[-1]]])
    idxs = torch.searchsorted(x[1:], xs)
    dx = (x[idxs + 1] - x[idxs])
    hh = h_poly((xs - x[idxs]) / dx)
    return hh[0] * y[idxs] + hh[1] * m[idxs] * dx + hh[2] * y[idxs + 1] + hh[3] * m[idxs + 1] * dx

After that, I refactor your time_wrap function to use grid_sample:

def time_warp(spec, W=50):
    # Input spec has shape (channel, freq_bin, frame)

    num_rows = spec.shape[-2]
    spec_len = spec.shape[-1]
    
    mid_y = num_rows//2
    mid_x = spec_len//2
    device = spec.device

    pt = torch.randint(W, spec_len - W, (1,), device=device)
    w = torch.randint(-W, W, (1,), device=device) # distance

    # Make source control point with 3 points in time axis: 2 anchor points and 1 control point
    src_ctr_pt_time = torch.tensor([0, warp_p, spec_len-1])
    dst_ctr_pt_time = torch.tensor([0,warp_p-warp_d, spec_len-1])
    dst_ctr_pt_time = dst_ctr_pt_time*2/(spec_len-1) - 1 # Normalize into the range [-1, 1] to match with grid_sample requirement
    
    # Interpolate
    src_ctr_pts = torch.linspace(0, spec_len-1, spec_len)
    dst_ctr_pts= interp(src_ctr_pt_time ,dst_ctr_pt_time , src_ctr_pts)

    # Destination
    grid = torch.cat((ys.view(1,1,-1,1).expand(1,num_rows,-1,1),
     torch.linspace(-1, 1, num_rows).view(-1,1,1).expand(1,-1,spec_len,1)), -1)

    # warp
    # unsqueeze since grid_sample require 4D tensor, meanwhile our tensor is only 3D
    warped_spectro = torch.nn.functional.grid_sample(spec.unsqueeze(0), grid, align_corners=True)
    return warped_spectro.squeeze(0)

Here is the result with pt=195 and w=82:

As you can see, the warped spectrogram looks more reasonable now when the warp distance is large (82 in comparison to audio with roughly 400 frames).

In addition to that, the run time is much faster. I run the code on colab using CPU and the original time_warp takes around 1.64s to run, while my implement takes only 12ms.

Lastly, I send you the final code that can perform augment on a batch of spectrograms at the end of this issue.
I haven't tested if this code uses less memory than sparse_image_warp or not, but the speed up given is a real deal. Hope this helps with simpler and faster implementation for our problem.

def h_poly(t):
    tt = t.unsqueeze(-2)**torch.arange(4, device=t.device).view(-1,1)
    A = torch.tensor([
        [1, 0, -3, 2],
        [0, 1, -2, 1],
        [0, 0, 3, -2],
        [0, 0, -1, 1]
    ], dtype=t.dtype, device=t.device)
    return A @ tt


def hspline_interpolate_1D(x, y, xs):
    '''
    Input x and y must be of shape (batch, n) or (n)
    '''
    m = (y[..., 1:] - y[..., :-1]) / (x[..., 1:] - x[..., :-1])
    m = torch.cat([m[...,[0]], (m[...,1:] + m[...,:-1]) / 2, m[...,[-1]]], -1)
    idxs = torch.searchsorted(x[..., 1:], xs)
    dx = (x.take_along_dim(idxs+1, dim=-1) - x.take_along_dim(idxs, dim=-1))
    hh = h_poly((xs - x.take_along_dim(idxs, dim=-1)) / dx)
    return hh[...,0,:] * y.take_along_dim(idxs, dim=-1) \
        + hh[...,1,:] * m.take_along_dim(idxs, dim=-1) * dx \
        + hh[...,2,:] * y.take_along_dim(idxs+1, dim=-1) \
        + hh[...,3,:] * m.take_along_dim(idxs+1, dim=-1) * dx

def time_warp(specs, W=50):
  '''
  Timewarp augmentation

  param:
    specs: spectrogram of size (batch, channel, freq_bin, length)
    W: strength of warp
  '''
  device = specs.device
  batch_size, _, num_rows, spec_len = specs.shape

  mid_y = num_rows//2
  mid_x = spec_len//2

  warp_p = torch.randint(W, spec_len - W, (batch_size,), device=device)

  # Uniform distribution from (0,W) with chance to be up to W negative
  # warp_d = torch.randn(1)*W # Not using this since the paper author make random number with uniform distribution
  warp_d = torch.randint(-W, W, (batch_size,), device=device)
  x = torch.stack([torch.tensor([0], device=device).expand(batch_size),
                 warp_p, torch.tensor([spec_len-1], device=device).expand(batch_size)], 1)
  y = torch.stack([torch.tensor([-1.], device=device).expand(batch_size),
                 (warp_p-warp_d)*2/(spec_len-1)-1, torch.tensor([1], device=device).expand(batch_size)], 1)

  # Interpolate from 3 points to spec_len
  xs = torch.linspace(0, spec_len-1, spec_len, device=device).unsqueeze(0).expand(batch_size, -1)
  ys = hspline_interpolate_1D(x, y, xs)

  grid = torch.cat(
      (ys.view(batch_size,1,-1,1).expand(-1,num_rows,-1,-1),
       torch.linspace(-1, 1, num_rows, device=device).view(-1,1,1).expand(batch_size,-1,spec_len,-1)), -1)

  return torch.nn.functional.grid_sample(specs, grid, align_corners=True)

Convert mel to audio

Hi, is there any thing I can do to convert the mel spectrogram back to audio? I am using librosa to covert it but it takes ages just to reform 8-10s audio. Thanks

Is it able to warp color image(3 channel) ?

It seems that the shape of input image is (b, h, w), while the same function in TensorFlow can deal with color image. Is the library able to warp color image(3 channel) by simply add a dismension on code?

It seems that there are two masking colors, Right?

Hi, @qcai2002 @zcaceres

It seems that there are two masking colors, Green and Yellow in pseudo-colors. right ?
Think of 0~255 pixel levels, Green is 128 level, and Yellow is 255 level?

Thanks.

fatal error: sox.h: No such file or directory

Hi,

I am trying to run your code install.sh file in conda environment and getting an error

cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/sox_effects.h:, from /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/register.cpp:4: /home/mds-student/Documents/SpecAUGMENT/spec_augment-2647a201138079f6486977a25a937133e3c5e825/torchaudio/torchaudio/csrc/sox_utils.h:4:10: fatal error: sox.h: No such file or directory 4 | #include <sox.h> | ^~~~~~~ compilation terminated. error: command 'gcc' failed with exit status 1 ./install.sh: line 7: jupyter: command not found

Can you suggest me some solution ?

Is the methodology of getting source and dest points logical in Time Warp?

`def time_warp(spec, W=5):
num_rows = spec.shape[1]
spec_len = spec.shape[2]
device = spec.device
y = num_rows//2
horizontal_line_at_ctr = spec[0][y]
assert len(horizontal_line_at_ctr) == spec_len

point_to_warp = horizontal_line_at_ctr[random.randrange(W, spec_len - W)]
assert isinstance(point_to_warp, torch.Tensor)

# Uniform distribution from (0,W) with chance to be up to W negative
dist_to_warp = random.randrange(-W, W)
src_pts, dest_pts = (torch.tensor([[[y, point_to_warp]]], device=device), 
                     torch.tensor([[[y, point_to_warp + dist_to_warp]]], device=device))
warped_spectro, dense_flows = sparse_image_warp(spec, src_pts, dest_pts)
return warped_spectro.squeeze(3)`

Based on my understanding, the point given should be a time coordinate. Currently, the point to wrap is returning an actual value from the spectrogram, which does not align with the computation of source and dest points.

In Time Warp, how to choose parameter W?

Hi, thank you for good github. I wonder that how to choose parameter W? Just random? I wonder what the range is if this is selected randomly.

Problem with the time_warp function

Hi, should the following line (which defines the position of the warping point)

point_to_warp = horizontal_line_at_ctr[random.randrange(W, spec_len - W)]

be changed to

point_to_warp = random.randrange(W, spec_len - W)

It doesn't make sense in the following computation if we add the specific value with a coordinate shift (dist_to_warp).

 src_pts, dest_pts = torch.tensor([[[y, point_to_warp]]]), torch.tensor([[[y, point_to_warp + dist_to_warp]]])

question about SPEC2DB

SPEC2DB is better than transforms.SpectrogramToDB? Why are you reimplemented this?

Thank you for all your assistance

recommended install method does not work

Cloning into 'torchaudio'...
remote: Enumerating objects: 87, done.
remote: Counting objects: 100% (87/87), done.
remote: Compressing objects: 100% (62/62), done.
remote: Total 1080 (delta 47), reused 47 (delta 23), pack-reused 993
Receiving objects: 100% (1080/1080), 6.13 MiB | 5.73 MiB/s, done.
Resolving deltas: 100% (442/442), done.
running install
running bdist_egg
running egg_info
creating torchaudio.egg-info
writing torchaudio.egg-info/PKG-INFO
writing dependency_links to torchaudio.egg-info/dependency_links.txt
writing requirements to torchaudio.egg-info/requires.txt
writing top-level names to torchaudio.egg-info/top_level.txt
writing manifest file 'torchaudio.egg-info/SOURCES.txt'
reading manifest file 'torchaudio.egg-info/SOURCES.txt'
writing manifest file 'torchaudio.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/transforms.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/common_utils.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/functional.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/sox_effects.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/legacy.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/kaldi_io.py -> build/lib.linux-x86_64-3.7/torchaudio
copying torchaudio/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio
creating build/lib.linux-x86_64-3.7/test
copying test/common_utils.py -> build/lib.linux-x86_64-3.7/test
copying test/test_jit.py -> build/lib.linux-x86_64-3.7/test
copying test/test_functional.py -> build/lib.linux-x86_64-3.7/test
copying test/test_kaldi_io.py -> build/lib.linux-x86_64-3.7/test
copying test/test_sox_effects.py -> build/lib.linux-x86_64-3.7/test
copying test/__init__.py -> build/lib.linux-x86_64-3.7/test
copying test/test.py -> build/lib.linux-x86_64-3.7/test
copying test/test_transforms.py -> build/lib.linux-x86_64-3.7/test
copying test/test_dataloader.py -> build/lib.linux-x86_64-3.7/test
copying test/test_legacy.py -> build/lib.linux-x86_64-3.7/test
creating build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/yesno.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/vctk.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
copying torchaudio/datasets/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio/datasets
creating build/lib.linux-x86_64-3.7/torchaudio/compliance
copying torchaudio/compliance/kaldi.py -> build/lib.linux-x86_64-3.7/torchaudio/compliance
copying torchaudio/compliance/__init__.py -> build/lib.linux-x86_64-3.7/torchaudio/compliance
running build_ext
building '_torch_sox' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/torchaudio
torchaudio/torch_sox.cpp:1:29: fatal error: torch/extension.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1