Giter Site home page Giter Site logo

charsiu's Introduction

Charsiu: A transformer-based phonetic aligner [arXiv]

Updates

  • 2.10.2022. We release phone- and word-level alignments for 860k utterances from the English subset of Common Voice. Check out this link.
  • 1.31.2022. We release phone- and word-level alignments for over a million Mandarin utterances. Check out this link.
  • 1.26.2022. Word alignment functionality has been added to charsiu_forced_aligner .

Intro

Charsiu is a phonetic alignment tool, which can:

  • recognise phonemes in a given audio file
  • perform forced alignment using phone transcriptions created in the previous step or provided by the user.
  • directly predict the phone-to-audio alignment from audio (text-independent alignment)

The aligner is under active development. New functions, new languages and detailed documentation will be added soon! Give us a star if you like our project!
Fun fact: Char Siu is one of the most representative dishes of Cantonese cuisine 🍲 (see wiki).

Table of content

Tutorial

[!NEW] A step-by-step tutorial for linguists: Open In Colab

You can directly run our model in the cloud via Google Colab!

  • Forced alignment: Open In Colab
  • Textless alignment: Open In Colab

Usage

git clone  https://github.com/lingjzhu/charsiu
cd charsiu

Forced alignment

from Charsiu import charsiu_forced_aligner
# if there are errors importing, uncomment the following lines and add path to charsiu
# import sys
# sys.path.append('path_to_charsiu/src')

# initialize model
charsiu = charsiu_forced_aligner(aligner='charsiu/en_w2v2_fc_10ms')
# perform forced alignment
alignment = charsiu.align(audio='./local/SA1.WAV',
                          text='She had your dark suit in greasy wash water all year.')
# perform forced alignment and save the output as a textgrid file
charsiu.serve(audio='./local/SA1.WAV',
              text='She had your dark suit in greasy wash water all year.',
              save_to='./local/SA1.TextGrid')


# Chinese
charsiu = charsiu_forced_aligner(aligner='charsiu/zh_w2v2_tiny_fc_10ms',lang='zh')
charsiu.align(audio='./local/SSB00050015_16k.wav',text='经广州日报报道后成为了社会热点。')
charsiu.serve(audio='./local/SSB00050015_16k.wav', text='经广州日报报道后成为了社会热点。',
              save_to='./local/SSB00050015.TextGrid')
              
# An numpy array of speech signal can also be passed to the model.
import soundfile as sf
y, sr = sf.read('./local/SSB00050015_16k.wav')
charsiu.align(audio=y,text='经广州日报报道后成为了社会热点。')

Textless alignment

from Charsiu import charsiu_predictive_aligner
# English
# initialize a model
charsiu = charsiu_predictive_aligner(aligner='charsiu/en_w2v2_fc_10ms')
# perform textless alignment
alignment = charsiu.align(audio='./local/SA1.WAV')
# Or
# perform textless alignment and output the results to a textgrid file
charsiu.serve(audio='./local/SA1.WAV', save_to='./local/SA1.TextGrid')


# Chinese
charsiu = charsiu_predictive_aligner(aligner='charsiu/zh_xlsr_fc_10ms',lang='zh')

charsiu.align(audio='./local/SSB16240001_16k.wav')
# Or
charsiu.serve(audio='./local/SSB16240001_16k.wav', save_to='./local/SSB16240001.TextGrid')

Pretrained models

Pretrained models are available at the 🤗 HuggingFace model hub: https://huggingface.co/charsiu.

Development plan

  • Package
Items Progress
Documentation Nov 2021
Textgrid support
Word Segmentation
Model compression TBD
IPA support TBD
  • Multilingual support
Language Progress
English (American)
Mandarin Chinese
German TBD
Spanish TBD
English (British) TBD
Cantonese TBD
AAVE TBD

Dependencies

pytorch
transformers
datasets
librosa
g2pe
praatio
g2pM

Training

The training pipeline is coming soon!

Note.Training code is in experiments/. Those were original research code for training the model. They still need to be reorganized.

Attribution and Citation

For now, you can cite this tool as:

@article{zhu2022charsiu,
  title={Phone-to-audio alignment without text: A Semi-supervised Approach},
  author={Zhu, Jian and Zhang, Cong and Jurgens, David},
  journal={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2022}
 }

Or

To share a direct web link: https://github.com/lingjzhu/charsiu/.

References

Transformers
s3prl
Montreal Forced Aligner

Disclaimer

This tool is a beta version and is still under active development. It may have bugs and quirks, alongside the difficulties and provisos which are described throughout the documentation. This tool is distributed under MIT license. Please see license for details.

By using this tool, you acknowledge:

  • That you understand that this tool does not produce perfect camera-ready data, and that all results should be hand-checked for sanity's sake, or at the very least, noise should be taken into account.

  • That you understand that this tool is a work in progress which may contain bugs. Future versions will be released, and bug fixes (and additions) will not necessarily be advertised.

  • That this tool may break with future updates of the various dependencies, and that the authors are not required to repair the package when that happens.

  • That you understand that the authors are not required or necessarily available to fix bugs which are encountered (although you're welcome to submit bug reports to Jian Zhu ([email protected]), if needed), nor to modify the tool to your needs.

  • That you will acknowledge the authors of the tool if you use, modify, fork, or re-use the code in your future work.

  • That rather than re-distributing this tool to other researchers, you will instead advise them to download the latest version from the website.

... and, most importantly:

  • That neither the authors, our collaborators, nor the the University of Michigan or any related universities on the whole, are responsible for the results obtained from the proper or improper usage of the tool, and that the tool is provided as-is, as a service to our fellow linguists.

All that said, thanks for using our tool, and we hope it works wonderfully for you!

Support or Contact

Please contact Jian Zhu ([email protected]) for technical support.
Contact Cong Zhang ([email protected]) if you would like to receive more instructions on how to use the package.

charsiu's People

Contributors

congzhang365 avatar lingjzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

charsiu's Issues

Possible fixes from fork

Thanks for this great package! I forked the repo to tweak a few things to help my use case, and some of them might be useful to merge back into the master branch. I haven't submitted a PR because some of them might not be appropriate/desirable to merge, so I figured you could tell me which ones you want and I could clean up the code/add some tests if necessary and submit a PR then.

Fork is at https://github.com/nmfisher/charsiu

Changes are:

  1. don't require sampling rate to be explicitly provided as librosa can resample to 16000Hz when loading a file
  2. re-instate punctuation and insert the punctuation token, rather than silence, into the phone list
  3. downweight silence to minimize erroneous insertion of silence in the middle of a word (this should probably be a parameter rather than a hardcoded 0.1)
  4. ignore silence where left and right phones are identical (to completely avoid inserting silence frames in the middle of consecutive frames for a single phone). This works for me right now but needs a bit more thought because if phones are intentionally repeated (e.g. "ai ai"), this will fold silence between them into the left phone, so "ai [SIL] ai") will always becomes "ai ai". Solution is probably just to pass a parameter for a minimum silence duration (so if silence is greater than X, it's presered, otherwise it's folded into the left phone).

wav2vec2-fs model for chinese alignment?

Hello, from your paper, it seems that the W2V2-FS‘s alignment is better than the W2V2-FC's, but now there is English W2V2_FS model only . Have you tested the W2V2-FS alignment Chinese? If you don't have a test, I'd like to train one to test
also I would like to know the specific steps of training. I have read your training code but I don't know what dataset I should used.
I'd like use W2V2-FS to replace the MFA
(sorry for my bad English)

Bug in phoneme to word conversion -- duplicate words

Something seems to be not right with how SIL is used in the word transcriptions.

This is the first example in the LibriSpeech Test set.

Here is the true transcript:

HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT

Here is the forced aligned word transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.33', 'he'],
       ['0.33', '0.65', 'began'],
       ['0.65', '0.69', 'a'],
       ['0.69', '1.21', 'confused'],
       ['1.21', '1.62', 'complaint'],
       ['1.62', '1.93', 'against'],
       ['1.93', '2.01', 'the'],
       ['2.01', '2.41', 'wizard'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'wizard'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.75', 'who'],
       ['2.75', '2.84', 'had'],
       ['2.84', '3.26', 'vanished'],
       ['3.26', '3.59', 'behind'],
       ['3.59', '3.66', 'the'],
       ['3.66', '4.02', 'curtain'],
       ['4.02', '4.15', 'on'],
       ['4.15', '4.23', 'the'],
       ['4.23', '4.66', 'left'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

Here is the forced aligned phonetic transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.3', 'HH'],
       ['0.3', '0.33', 'IY'],
       ['0.33', '0.39', 'B'],
       ['0.39', '0.44', 'IH'],
       ['0.44', '0.53', 'G'],
       ['0.53', '0.6', 'AE'],
       ['0.6', '0.65', 'N'],
       ['0.65', '0.69', 'AH'],
       ['0.69', '0.77', 'K'],
       ['0.77', '0.81', 'AH'],
       ['0.81', '0.86', 'N'],
       ['0.86', '0.97', 'F'],
       ['0.97', '1.02', 'Y'],
       ['1.02', '1.1', 'UW'],
       ['1.1', '1.16', 'Z'],
       ['1.16', '1.21', 'D'],
       ['1.21', '1.26', 'K'],
       ['1.26', '1.3', 'AH'],
       ['1.3', '1.34', 'M'],
       ['1.34', '1.44', 'P'],
       ['1.44', '1.49', 'L'],
       ['1.49', '1.55', 'EY'],
       ['1.55', '1.58', 'N'],
       ['1.58', '1.62', 'T'],
       ['1.62', '1.66', 'AH'],
       ['1.66', '1.74', 'G'],
       ['1.74', '1.78', 'EH'],
       ['1.78', '1.84', 'N'],
       ['1.84', '1.9', 'S'],
       ['1.9', '1.93', 'T'],
       ['1.93', '1.96', 'DH'],
       ['1.96', '2.01', 'AH'],
       ['2.01', '2.1', 'W'],
       ['2.1', '2.15', 'IH'],
       ['2.15', '2.26', 'Z'],
       ['2.26', '2.34', 'ER'],
       ['2.34', '2.41', 'D'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'D'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.7', 'HH'],
       ['2.7', '2.75', 'UW'],
       ['2.75', '2.78', 'HH'],
       ['2.78', '2.8', 'AE'],
       ['2.8', '2.84', 'D'],
       ['2.84', '2.95', 'V'],
       ['2.95', '3.04', 'AE'],
       ['3.04', '3.09', 'N'],
       ['3.09', '3.15', 'IH'],
       ['3.15', '3.23', 'SH'],
       ['3.23', '3.26', 'T'],
       ['3.26', '3.3', 'B'],
       ['3.3', '3.35', 'IH'],
       ['3.35', '3.43', 'HH'],
       ['3.43', '3.53', 'AY'],
       ['3.53', '3.56', 'N'],
       ['3.56', '3.59', 'D'],
       ['3.59', '3.62', 'DH'],
       ['3.62', '3.66', 'AH'],
       ['3.66', '3.78', 'K'],
       ['3.78', '3.9', 'ER'],
       ['3.9', '3.93', 'T'],
       ['3.93', '3.96', 'AH'],
       ['3.96', '4.02', 'N'],
       ['4.02', '4.09', 'AA'],
       ['4.09', '4.15', 'N'],
       ['4.15', '4.19', 'DH'],
       ['4.19', '4.23', 'AH'],
       ['4.23', '4.36', 'L'],
       ['4.36', '4.47', 'EH'],
       ['4.47', '4.58', 'F'],
       ['4.58', '4.66', 'T'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

I suspect this may indicate a general problem with the phoneme to word conversion.

Scripts to reproduce results from paper?

Would you have the scripts to reproduce the results from the papers (I'm particularly interested in table 2), or maybe the procedures to reproduce them from this repo?

Reproducing Results for W2V2-FS-20ms

Based on the paper, I've successfully reproduced results for Charsiu's FC-10ms, textless FC-10ms, MFA, WebMaus, but I'm having trouble reproducing the pretrained FS-20ms model.
I first downloaded the charsiu/en_w2v2_fs_10ms from HuggingFace into my working directory.
Then I followed the tutorial for generating alignments. When I try

charsiu = charsiu_forced_aligner(aligner='charsiu/en_w2v2_fs_10ms')

, the results are complete gibberish and are nowhere near the paper's results.

When I try

charsiu = charsiu_attention_aligner(aligner='charsiu/en_w2v2_fs_10ms')

, the results are slightly better, but still not as good as that of the paper's.

My questions are:

  1. Which one of the above lines should I be using when calling the fs_10ms aligner?
  2. Is there perhaps a step I'm missing after downloading the model from HuggingFace?

Thanks so much!

What each aligner does

I keep forgetting which aligner does what so here it is:

  • Wav2Vec2ForFrameClassification is just w2v2 with a linear layer head.
  • charsiu_predictive_aligned takes the argmax of the logits of the Wav2Vec2ForFrameClassification model
  • charsiu_forced_aligner does g2p on a given transcript, then uses those sequence of phones to index into the logits of Wav2Vec2ForFrameClassification along the phone_id axis. Then DTW can be used on the resulting sequence vs time tensor find the alignment.

    charsiu/src/utils.py

    Lines 304 to 305 in 13a69f2

    D,align = dtw(C=-cost[:,phone_ids],
    step_sizes_sigma=np.array([[1, 1], [1, 0]]))
  • charisu_attention_aligner uses Wav2Vec2ForAttentionAlignment which uses w2v2 for encoding speech and a BERT for encoding phonemes, and then something really over engineered. The DTW is the correct way to normalize the output of w2v2, and it seems that Wav2Vec2ForAttentionAlignment only exists because DTW was overlooked. This should be depreciated?
  • charsiu_chain_forced_aligner does w2v2-c2c to get phonemes, then Wav2Vec2ForAttentionAlignment followed by DTW. Perhaps this should be replaced by the charsiu_forced_aligner where the phonemes are obtained from w2v2-c2c.

How do we use the pretrained attention aligner?

Hi, I find that getting a pretrained predictive aligner (aligner='charsiu/en_w2v2_fc_10ms') to work with librispeech seems straightforward. However, I'm unable to get the attention aligner working - how do I go about initializing the aligner and how do I get the corresponding bert config to go with it? Keeps throwing an error for the same.

Can't currently support long audio ?

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

About training

experiments Those were original research code for training the model.
Good job. I want to pre-train on my Chinese dataset. I don't know whether the code in experiments is OK. If so, can you write the training instructions roughly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.