Giter Site home page Giter Site logo

fcn-f0's Introduction

FCN-f0

Code for running monophonic pitch (F0) estimation using the fully-convolutional neural network models described in the publication :

L. Ardaillon and A. Roebel, "Fully-Convolutional Network for Pitch Estimation of Speech Signals", Proc. Interspeech, 2019.

We kindly request academic publications making use of our FCN models to cite this paper, which can be dowloaded from the following url : https://hal.archives-ouvertes.fr/hal-02439798/document

Description

The code provided in this repository aims at performing monophonic pitch (F0) estimation using Fully-Convolutional Neural Networks. It is partly based on the code from the CREPE repository => https://github.com/marl/crepe

The provided code allows to run the pitch estimation on given sound files using the provided pretrained models, but no code is currently provided to train the model on new data. Three different fully-convolutional pre-trained models are provided. Those models have been trained exclusively on (synthetic) speech data and may thus not perform as well on other types of sounds such as music instruments. Note that the output F0 values are also limited to the target range [30-1000]Hz, which is suitable for vocal signals (including high-pitched soprano singing).

The models, algorithm, training, and evaluation procedures have been described in a publication entitled "Fully-Convolutional Network for Pitch Estimation of Speech Signals", presented at the Interspeech 2019 conference (https://www.isca-speech.org/archive/Interspeech_2019/pdfs/2815.pdf).

Below are the results of our evaluations comparing our models to the SWIPE algorithm and CREPE model, in terms of Raw Pitch Accuracy (average value and standard deviation, on both a test database of synthetic speech "PAN-synth" and a database of real speech samples with manually-corrected ground truth "manual"). For this evaluation, the CREPE model has been evaluated both with the provided pretrained model from the CREPE repository ("CREPE" in the table) and with a model retrained from scratch on our synthetic database ("CREPE-speech"). FCN models have been evluated on 8kHz audio, while CREPE and SWIPE have been trained and evaluated on 16kHz audio.

FCN-1953 FCN-993 FCN-929 CREPE CREPE-speech SWIPE
PAN-synth (25 cents) 93.62 ± 3.34% 94.31 ± 3.15% 93.50 ± 3.43% 77.62 ± 9.31% 86.92 ± 8.28% 84.56 ± 11.68%
PAN-synth (50 cents) 98.37 ± 1.62% 98.53 ± 1.54% 98.27 ± 1.73% 91.23 ± 6.00% 97.27 ± 2.09% 93.10 ± 7.26%
PAN-synth (200 cents) 99.81 ± 0.64% 99.79 ± 0.65% 99.77 ± 0.73% 95.65 ± 5.17% 99.25 ± 1.07% 97.51 ± 4.90%
manual (50 cents) 88.32 ± 6.33% 88.57 ± 5.77% 88.88 ± 5.73% 87.03 ± 7.35% 88.45 ± 5.70% 85.93 ± 7.62%
manual (200 cents) 97.35 ± 3.02% 97.31 ± 2.56% 97.36 ± 2.51% 92.57 ± 5.22% 96.63 ± 2.91% 95.03 ± 4.04%

Our synthetic speech database has been created by resynthesizing the BREF [2] and TIMIT [3] databases using the PAN synthesis engine, described in [4, Section 3.5.2].

We also compared the different models and algorithms in terms of potential latency (with real-time implementation in mind), where the latency corresponds to the duration of half the (minimal) input size, and computation times on both a GPU and single-core CPU :

FCN-1953 FCN-993 FCN-929 CREPE SWIPE
latency (s) 0.122 0.062 0.058 0.032 0.128
Computation time on GPU (s) 0.016 0.010 0.021 0.092 X
Computation time on CPU (s) 1.65 0.89 3.34 14.79 0.63

Example command-line usage (using provided pretrained models)

Default analysis : This will run the FCN-993 model and output the result as a csv file in the same folder than the input file (replacing the file extension by ".csv")

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav

Run the analysis on a whole folder of audio files :

python /path_to/FCN-f0/FCN-f0.py /path_to/audio_files

Specify an output directory or file name with "-o" option(if directory doesn't exist, it will be created):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -o /path_to/output.f0.csv python /path_to/FCN-f0/FCN-f0.py /path_to/audio_files -o /path_to/output_dir

Choose a specific model for running the analysis (default is FCN-993):

Use FCN-929 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 929 -o /path_to/output.f0-929.csv

Use FCN-993 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 993 -o /path_to/output.f0-993.csv

Use FCN-1953 model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m 1953 -o /path_to/output.f0-1953.csv

Use CREPE-speech model : python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -m CREPE -o /path_to/output.f0-CREPE.csv

Apply viterbi smoothing of output :

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -vit

Output result to sdif format (requires installing the eaSDIF python library. Default format is csv):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -f sdif

Deactivate fully-convolutional mode (For comparison purpose, but not recommanded otherwise, as it makes the computation much slower):

python /path_to/FCN-f0/FCN-f0.py /path_to/test.wav -FC 0

References

[1] Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello. "CREPE: A Convolutional Representation for Pitch Estimation", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.

[2] J. L. Gauvain, L. F. Lamel, and M. Eskenazi, “Design Considerations and Text Selection for BREF, a large French Read-Speech Corpus,” 1st International Conference on Spoken Language Processing, ICSLP, no. January 2013, pp. 1097–1100, 1990. http://www.limsi.fr/~lamel/kobe90.pdf

[3] V. Zue, S. Seneff, and J. Glass, “Speech Database Development At MIT : TIMIT And Beyond,” vol. 9, pp. 351–356, 1990.

[4] L. Ardaillon, “Synthesis and expressive transformation of singing voice,” Ph.D. dissertation, EDITE; UPMC-Paris 6 Sorbonne Universités, 2017.

fcn-f0's People

Contributors

ardaillon avatar tikuma-lsuhsc avatar hokiedsp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.