Giter Site home page Giter Site logo

torchcrepe's Introduction

torchcrepe

PyPI License Downloads

Pytorch implementation of the CREPE [1] pitch tracker. The original Tensorflow implementation can be found here. The provided model weights were obtained by converting the "tiny" and "full" models using MMdnn, an open-source model management framework.

Installation

Perform the system-dependent PyTorch install using the instructions found here.

pip install torchcrepe

Usage

Computing pitch and periodicity from audio

import torchcrepe


# Load audio
audio, sr = torchcrepe.load.audio( ... )

# Here we'll use a 5 millisecond hop length
hop_length = int(sr / 200.)

# Provide a sensible frequency range for your domain (upper limit is 2006 Hz)
# This would be a reasonable range for speech
fmin = 50
fmax = 550

# Select a model capacity--one of "tiny" or "full"
model = 'tiny'

# Choose a device to use for inference
device = 'cuda:0'

# Pick a batch size that doesn't cause memory errors on your gpu
batch_size = 2048

# Compute pitch using first gpu
pitch = torchcrepe.predict(audio,
                           sr,
                           hop_length,
                           fmin,
                           fmax,
                           model,
                           batch_size=batch_size,
                           device=device)

A periodicity metric similar to the Crepe confidence score can also be extracted by passing return_periodicity=True to torchcrepe.predict.

Decoding

By default, torchcrepe uses Viterbi decoding on the softmax of the network output. This is different than the original implementation, which uses a weighted average near the argmax of binary cross-entropy probabilities. The argmax operation can cause double/half frequency errors. These can be removed by penalizing large pitch jumps via Viterbi decoding. The decode submodule provides some options for decoding.

# Decode using viterbi decoding (default)
torchcrepe.predict(..., decoder=torchcrepe.decode.viterbi)

# Decode using weighted argmax (as in the original implementation)
torchcrepe.predict(..., decoder=torchcrepe.decode.weighted_argmax)

# Decode using argmax
torchcrepe.predict(..., decoder=torchcrepe.decode.argmax)

Filtering and thresholding

When periodicity is low, the pitch is less reliable. For some problems, it makes sense to mask these less reliable pitch values. However, the periodicity can be noisy and the pitch has quantization artifacts. torchcrepe provides submodules filter and threshold for this purpose. The filter and threshold parameters should be tuned to your data. For clean speech, a 10-20 millisecond window with a threshold of 0.21 has worked.

# We'll use a 15 millisecond window assuming a hop length of 5 milliseconds
win_length = 3

# Median filter noisy confidence value
periodicity = torchcrepe.filter.median(periodicity, win_length)

# Remove inharmonic regions
pitch = torchcrepe.threshold.At(.21)(pitch, periodicity)

# Optionally smooth pitch to remove quantization artifacts
pitch = torchcrepe.filter.mean(pitch, win_length)

For more fine-grained control over pitch thresholding, see torchcrepe.threshold.Hysteresis. This is especially useful for removing spurious voiced regions caused by noise in the periodicity values, but has more parameters and may require more manual tuning to your data.

CREPE was not trained on silent audio. Therefore, it sometimes assigns high confidence to pitch bins in silent regions. You can use torchcrepe.threshold.Silence to manually set the periodicity in silent regions to zero.

periodicity = torchcrepe.threshold.Silence(-60.)(periodicity,
                                                 audio,
                                                 sr,
                                                 hop_length)

Computing the CREPE model output activations

batch = next(torchcrepe.preprocess(audio, sr, hop_length))
probabilities = torchcrepe.infer(batch)

Computing the CREPE embedding space

As in Differentiable Digital Signal Processing [2], this uses the output of the fifth max-pooling layer as a pretrained pitch embedding

embeddings = torchcrepe.embed(audio, sr, hop_length)

Computing from files

torchcrepe defines the following functions convenient for predicting directly from audio files on disk. Each of these functions also takes a device argument that can be used for device placement (e.g., device='cuda:0').

torchcrepe.predict_from_file(audio_file, ...)
torchcrepe.predict_from_file_to_file(
    audio_file, output_pitch_file, output_periodicity_file, ...)
torchcrepe.predict_from_files_to_files(
    audio_files, output_pitch_files, output_periodicity_files, ...)

torchcrepe.embed_from_file(audio_file, ...)
torchcrepe.embed_from_file_to_file(audio_file, output_file, ...)
torchcrepe.embed_from_files_to_files(audio_files, output_files, ...)

Command-line interface

usage: python -m torchcrepe
    [-h]
    --audio_files AUDIO_FILES [AUDIO_FILES ...]
    --output_files OUTPUT_FILES [OUTPUT_FILES ...]
    [--hop_length HOP_LENGTH]
    [--output_periodicity_files OUTPUT_PERIODICITY_FILES [OUTPUT_PERIODICITY_FILES ...]]
    [--embed]
    [--fmin FMIN]
    [--fmax FMAX]
    [--model MODEL]
    [--decoder DECODER]
    [--gpu GPU]
    [--no_pad]

optional arguments:
  -h, --help            show this help message and exit
  --audio_files AUDIO_FILES [AUDIO_FILES ...]
                        The audio file to process
  --output_files OUTPUT_FILES [OUTPUT_FILES ...]
                        The file to save pitch or embedding
  --hop_length HOP_LENGTH
                        The hop length of the analysis window
  --output_periodicity_files OUTPUT_PERIODICITY_FILES [OUTPUT_PERIODICITY_FILES ...]
                        The file to save periodicity
  --embed               Performs embedding instead of pitch prediction
  --fmin FMIN           The minimum frequency allowed
  --fmax FMAX           The maximum frequency allowed
  --model MODEL         The model capacity. One of "tiny" or "full"
  --decoder DECODER     The decoder to use. One of "argmax", "viterbi", or
                        "weighted_argmax"
  --gpu GPU             The gpu to perform inference on
  --no_pad              Whether to pad the audio

Tests

The module tests can be run as follows.

pip install pytest
pytest

References

[1] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A Convolutional Representation for Pitch Estimation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] J. H. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Digital Signal Processing,” in 2020 International Conference on Learning Representations (ICLR).

torchcrepe's People

Contributors

ben-hayes avatar fumiama avatar jonluca avatar leng-yue avatar maxrmorrison avatar mpariente avatar tandav avatar thunn avatar yqzhishen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchcrepe's Issues

.predict does not work.

Hello! Thank you for your contribution :)

I was going to simply test the torchcrepe.predict() but it won't work.
The code is simple as below.
I installed the torchcrepe using pip.

audio, sr = torchcrepe.load.audio(path)

frame_rate = num_frames / (len(audio)/sr)
crepe_step_size = 1000 / frame_rate # milliseconds
fmin = 50
fmax = 550

model = 'full'

device = 'cuda:0'

batch_size = 1

pitch, harmonicity = torchcrepe.predict(audio,
sr,
crepe_step_size,
fmin,
fmax,
model,
batch_size=batch_size,
device=device,
decoder=torchcrepe.decode.viterbi,
return_harmonicity=True)

TypeError: predict() got an unexpected keyword argument 'device'

Is this error expected to happen?

Using Weighted Argmax Local Average decoding results in RuntimeError: iter.device(arg).is_cuda() INTERNAL ASSERT FAILED

Using PyPI release v0.0.12, calling .predict with decoder=torchcrepe.decode.weighted_argmax results in the following error:

RuntimeError: iter.device(arg).is_cuda() INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Loops.cuh":94, please report a bug to PyTorch.

This issue on the PyTorch repo (pytorch/pytorch#48393) suggests this is because of a bug preventing tensors from being compared to scalar tensors on a different device. This may be fixed in a later release, but for now it may be best to simply instantiate scalar tensors on the same device as the input.

The problematic lines are:

start = torch.max(torch.tensor(0), bins - 4)
end = torch.min(torch.tensor(logits.size(2)), bins + 5)

If it's okay with you, I'll submit a small PR to address this until the issue is resolved in core PyTorch.

Prevent dataloader from allocating memory on unused GPUs

By default, the PyTorch dataloader allocates some memory on all available GPUs. This is usually solved with environment variables, but torchcrepe shouldn't change the user's environment variables. Potential solution: context manager for hot-swapping environment variables.

Support double input

From version 0.0.19

Code:

import torch
import torchcrepe
torch.set_default_dtype(torch.double)
torchcrepe.filter.mean(torch.DoubleTensor(1,100), 100)

Error:

        # Count the non-masked (valid) elements in each pooling window
>       valid_count = F.conv1d(
            mask.float(),
            ones_kernel,
            stride=1,
            padding=win_length // 2,
        )
E       RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.DoubleTensor) should be the same

Will this error be corrected in the near future?

incorrect prediction for Nsynth dataset

Hi there,

Thanks for the re-implementation! It's really well-formatted.

I have encountered some issues regarding the prediction value. I used one sample from Nsynth dataset as the inputfile(bass_synthetic_009-025-127.wav). Check file here:https://drive.google.com/file/d/1_Ltj9Pbezx_5Ve-MLVrkF924vAfJ6j2C/view?usp=sharing

The label of the file shows it has midi pitch 25 which, after some proper calculation, is equivalent to around 34Hz.

However, when I run the algo it returns me
Screenshot 2021-10-19 at 1 06 04 PM
which seems incorrect.

I run the original crepe tf version and it returns me around 34 or 35Hz.

May I know what causes the error, or maybe the data did you train the model with didn't include music data?

Best,

Nic

traceback when using median filter

pytorch version: 1.10.0
torchcrepe version: 0.0.19

File "/root/miniconda3/lib/python3.9/site-packages/torchcrepe/filter.py", line 90, in median
x_masked = torch.where(mask.bool(), x, float("inf"))
RuntimeError: expected scalar type float but found double

(Pdb) x.dtype
torch.float32

pd = torchcrepe.filter.median(pd, 3)
pd.dtype is float32

I made a pr to fix it.

Allow arbitrary length generation without GPU memory error

Currently, GPU memory errors prohibit long-form pitch tracking. To fix this, the batch size should be set either by the user or automatically by querying GPU capacity (e.g., with GPUtil). If the file to process exceeds one batch in size, it will have to be detached from the compute graph. Additionally, the option to detach from the compute graph should be provided. If this is given, torch.no_grad() can be safely used to further reduce memory consumption.

torchcrepe.preprocess returns 'generator' object

First of all, I really appreciate for your great repo.

During I followed your instructions, I got some error with torchcrepe.preprocess()
What I've done is ,

import torch
import torchcrepe
audio = torch.rand([1,22050])
torchcrepe.infer(torchcrepe.preprocess(audio, 22050, 256))

And it raised following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jg623/miniconda3/envs/pwgvc/lib/python3.6/site-packages/torchcrepe/core.py", line 439, in infer
    torchcrepe.load.model(frames.device, model)
AttributeError: 'generator' object has no attribute 'device'

Did I something wrong with torchcrepe?

In addition, do you have any plan to add batch-wised functions? (i.e. [#batch, #audiosample] to [#batch, #f0s] )

Algorithmic idea: Implement a forward-backward decoder?

First of all: Thanks a lot for providing this library, it is super useful! I'm opening this "issue" merely as a means to brainstorm some algorithmic ideas. Feel free to just close the ticket at any time.

I noticed the following behavior: In general it looks like the algorithm is sensitive to the time direction, i.e., the confidence is rather weak directly after the onset of a note, but once a stable pitch has been established, the pitch is tracked robustly into the decay phase of the note. To verify this, I simply applied the algorithm to the time-reversed signal, which just shows the opposite behavior. For example, the following plot shows the transition of 3 consecutive chromatic notes:

image

The middle plot is pitch/frequency, the bottom plot the periodicity/confidence. The blue lines corresponds to applying the algorithm in forward direction. The yellow line corresponds to the application in backwards time direction. Note that the pitch estimates and confidences are basically "shifted" in either forward or backward time direction around the "area of uncertainty".

As a very naive approach, I have simply combined the information of the forward and backward pass (the green and the red curve). I'm basically just weighting the two results with weights corresponding to |confidence|^p where p is an exponent that allows to transition from just averaging to taking the one with maximum confidence. Even this naive forward+backward seems to improve the results quite a bit.

I'm wondering if it would be worthwhile to actually incorporate such a "forward + backward" logic directly into the decoder? Originally my assumption was that Viterbi decoding would be invariant with the time direction, but that doesn't seem to be the case. Perhaps exploiting the information from both time direction on the decoder level could even lead to better results?

Plotting results? How to do it best?

Hi,

is there a way to plot the results? I have tried some things already but it seems like everyone does it differently.
Any hints where to start?

Predict over batch size greater than 1?

It appears that you can't really predict over audio with batch size greater than 1?

import torch
import torchcrepe

audio = torch.rand((2, 48000), device=device)

# Here we'll use a 5 millisecond hop length
hop_length = 16000 / 200

# Provide a sensible frequency range for your domain (upper limit is 2006 Hz)
# This would be a reasonable range for speech
fmin = 50
fmax = 550

# Select a model capacity--one of "tiny" or "full"
model = 'full'

# Pick a batch size that doesn't cause memory errors on your gpu
batch_size = 2048

# Compute pitch using first gpu
pitch = torchcrepe.predict(audio,
                           16000,
                           hop_length,
                           fmin,
                           fmax,
                           model,
                           batch_size=batch_size,
                           return_periodicity=True
           )

gives:

  File "/opt/miniconda3/lib/python3.9/site-packages/torchcrepe/core.py", line 117, in predict
    for frames in generator:
  File "/opt/miniconda3/lib/python3.9/site-packages/torchcrepe/core.py", line 682, in preprocess
    frames = torch.nn.functional.unfold(
  File "/opt/miniconda3/lib/python3.9/site-packages/torch/nn/functional.py", line 4756, in unfold
    return torch._C._nn.im2col(input, _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride))
TypeError: im2col(): argument 'stride' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got float"

(This is even though I have 16000 Hz audio, which skips the resample line which assumes 1-batch audio.)

RuntimeError: GlooDeviceFactory::makeUVDevice(): interface or hostname can't be empty on mango rvc

Traceback (most recent call last):
File "multiprocessing\process.py", line 315, in _bootstrap
File "multiprocessing\process.py", line 108, in run
File "C:\Users\me\Desktop\rvc\train_nsf_sim_cache_sid_load_pretrain.py", line 100, in run
dist.init_process_group(
File "C:\Users\me\Desktop\rvc\runtime\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "C:\Users\me\Desktop\rvc\runtime\lib\site-packages\torch\distributed\distributed_c10d.py", line 994, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: GlooDeviceFactory::makeUVDevice(): interface or hostname can't be empty

and i did run pip install torchcrepe beforehand

Actually load and return a model?

I am trying to implement a torchcrepe baseline for the HEAR 2021 NeurIPS competition. You can see a wav2vec2 example here: https://github.com/neuralaudio/hear-baseline/blob/main/hearbaseline/wav2vec2.py

However, torchcrepe.load.model is prefer weird. It just kinda loads the model into some global explicit state dict?

Is there a way actually to simple load the pytorch model and pass it around? The current implementation is pretty opinionated and hard to build on

Thoughts and suggestions about improving precision and smoothening the results

Here are three topics related to the postprocessing methods.

1. Why use sigmoid here?

# Convert to probabilities
with torch.no_grad():
probs = torch.sigmoid(logits)
# Apply weights
cents = (weighted_argmax.weights * probs).sum(dim=1) / probs.sum(dim=1)
As mentioned in the original CREPE paper, the frequency bins are "Gaussian-blurred" by the ground truth f0 in the training label. As the unbiased estimate of the expectation of the normal distribution is the sample average, the converting method should be relu instead of sigmoid, i.e., computing the direct average of local bins with positive value.
Also, the original TensorFlow repository uses local average instead of sigmoid. I did my own experiments and proved that results produced by direct average of logits is much smoother than the current version, even without dithering and filtering.

2. Combine viterbi and weighted argmax.

The current viterbi decoder searches frequency bins along the best path, but in a precision of 20 cents, since viterbi algorithm is for discrete states:

# Convert to pytorch
bins = torch.tensor(bins, device=probs.device)
# Convert to frequency in Hz
return bins, torchcrepe.convert.bins_to_frequency(bins)
However, we can then apply what the weighted argmax decoder does, as something called weighted_viterbi for example. For short, this means replacing the first line of argmax operation in the weighted argmax decoder with viterbi. In this way we got a smoother result without quantization errors, while not depending on dithering. The original TensorFlow repository also implemented this as the default behavior of the viterbi decoder.

3. Consider disabling dithering or making it optional.

As discussed above, dithering seems to do more harm than good, especially to weighted decoders. In my own experiments, the weighted_viterbi decoder produce quite smooth results without dithering and filtering, which are also more accurate without random errors broughted by dithering.

There are two ways to solve this problem in my opinion:

  1. Disable dithering to weighted decoder and enable it for others.
  2. Add an option to let the user choose whether to apply dithering.

What are your thoughts about these topics? I'm submitting this issue because I think it would be better to have some discussions before I could make a pull request on it.

Rewrite median and mean pooling to achieve 5x faster

After profiling, I found that the current implementation of mean and median pooling is extremely slow (python loops). And they consumed most of their time in crepe inference. Thus, I rewrote them in torch and accelerated these two functions by 1000x. Overall, it can improve the inference speed of crepe full to 5x of the original speed.

The rewrited pooling and test cases can be found here: fishaudio/fish-diffusion@f03014f

I would like to make a PR to torchcrepe if torchcrepe wants to adapt this optimization.

threshold.Silence fails for batched signals

threshold.Silence crashes if the batch dimension of the input tensor is > 1. It seems to be because loudness is calculated by converting tensors to numpy and using librosa. The line that causes the actual crash is line 37 from loudness.py, which tries to squeeze the 0th dimension.

Librosa doesn't support batches, but since calculating the loudness only involves STFT, logs and means, it doesn't seems hard change it to use the torch version of these functions.

Since the rest of the package (at least seems to) works fine with batched inputs, this might be a worthwhile change.

Remove torchaudio dependency

Torchaudio was originally used for resampling, but replaced with resampy for more precise numerical agreement with the original Crepe. Now, it is only used in the test cases for loading.

Same input, same params, different output

Issue: The model would produce different results for the same input.
To reproduce:

import torch, torchcrepe

audio = torch.rand([1, 10000])
pitch_0 = torchcrepe.predict(audio=audio, sample_rate=16000, model='tiny')
pitch_1 = torchcrepe.predict(audio=audio, sample_rate=16000, model='tiny')
torch.equal(pitch_0, pitch_1)

Expected behaviour: The above code should return True.
Actual behaviour: It returns False, indicating that pitch_0 and pitch_1 have different values.

Is there a way to make sure that the model returns consistent results across multiple runs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.