lumaku / ctc-segmentation Goto Github PK

View Code? Open in Web Editor NEW

309.0 13.0 28.0 284 KB

Segment an audio file and obtain utterance alignments. (Python package)

License: Apache License 2.0

Python 85.56% Makefile 2.70% Cython 11.74%

ctc-segmentation's Introduction

CTC segmentation

CTC segmentation can be used to find utterance alignments within large audio files.

This repository contains the ctc-segmentation python package.
A description of the algorithm is in the CTC segmentation paper (on Springer Link, on ArXiv)

Usage

The CTC segmentation package is not standalone, as it needs a neural network with CTC output. It is integrated in these frameworks:

In ESPnet 1 as corpus recipe: Alignment script, Example recipe, Demo
In ESPnet 2, as script or directly as python interface: Alignment script, Demo
In Nvidia NeMo as dataset creation tool: Documentation, Example
In Speechbrain, as python interface: Alignment module, Examples

It can also be used with other frameworks:

Wav2vec2 example code

import torch
import numpy as np
from typing import List
import ctc_segmentation
from datasets import load_dataset
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer

# load model, processor and tokenizer
model_name = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
processor = Wav2Vec2Processor.from_pretrained(model_name)
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

# load dummy dataset and read soundfiles
SAMPLERATE = 16000
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio = ds[0]["audio"]["array"]
transcripts = ["A MAN SAID TO THE UNIVERSE", "SIR I EXIST"]

def align_with_transcript(
    audio : np.ndarray,
    transcripts : List[str],
    samplerate : int = SAMPLERATE,
    model : Wav2Vec2ForCTC = model,
    processor : Wav2Vec2Processor = processor,
    tokenizer : Wav2Vec2CTCTokenizer = tokenizer
):
    assert audio.ndim == 1
    # Run prediction, get logits and probabilities
    inputs = processor(audio, return_tensors="pt", padding="longest")
    with torch.no_grad():
        logits = model(inputs.input_values).logits.cpu()[0]
        probs = torch.nn.functional.softmax(logits,dim=-1)
    
    # Tokenize transcripts
    vocab = tokenizer.get_vocab()
    inv_vocab = {v:k for k,v in vocab.items()}
    unk_id = vocab["<unk>"]
    
    tokens = []
    for transcript in transcripts:
        assert len(transcript) > 0
        tok_ids = tokenizer(transcript.replace("\n"," ").lower())['input_ids']
        tok_ids = np.array(tok_ids,dtype=np.int)
        tokens.append(tok_ids[tok_ids != unk_id])
    
    # Align
    char_list = [inv_vocab[i] for i in range(len(inv_vocab))]
    config = ctc_segmentation.CtcSegmentationParameters(char_list=char_list)
    config.index_duration = audio.shape[0] / probs.size()[0] / samplerate
    
    ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_token_list(config, tokens)
    timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, probs.numpy(), ground_truth_mat)
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcripts)
    return [{"text" : t, "start" : p[0], "end" : p[1], "conf" : p[2]} for t,p in zip(transcripts, segments)]
    
def get_word_timestamps(
    audio : np.ndarray,
    samplerate : int = SAMPLERATE,
    model : Wav2Vec2ForCTC = model,
    processor : Wav2Vec2Processor = processor,
    tokenizer : Wav2Vec2CTCTokenizer = tokenizer
):
    assert audio.ndim == 1
    # Run prediction, get logits and probabilities
    inputs = processor(audio, return_tensors="pt", padding="longest")
    with torch.no_grad():
        logits = model(inputs.input_values).logits.cpu()[0]
        probs = torch.nn.functional.softmax(logits,dim=-1)
        
    predicted_ids = torch.argmax(logits, dim=-1)
    pred_transcript = processor.decode(predicted_ids)
    
    # Split the transcription into words
    words = pred_transcript.split(" ")
    
    # Align
    vocab = tokenizer.get_vocab()
    inv_vocab = {v:k for k,v in vocab.items()}
    char_list = [inv_vocab[i] for i in range(len(inv_vocab))]
    config = ctc_segmentation.CtcSegmentationParameters(char_list=char_list)
    config.index_duration = audio.shape[0] / probs.size()[0] / samplerate
    
    ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, words)
    timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, probs.numpy(), ground_truth_mat)
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, words)
    return [{"text" : w, "start" : p[0], "end" : p[1], "conf" : p[2]} for w,p in zip(words, segments)]

print(align_with_transcript(audio,transcripts))
# [{'text': 'A MAN SAID TO THE UNIVERSE', 'start': 0.08124999999999993, 'end': 2.034375, 'conf': 0.0}, 
#  {'text': 'SIR I EXIST', 'start': 2.3260775862068965, 'end': 4.078771551724138, 'conf': 0.0}]

print(get_word_timestamps(audio))
# [{'text': 'a', 'start': 0.08124999999999993, 'end': 0.5912715517241378, 'conf': 0.9999501323699951}, 
# {'text': 'man', 'start': 0.5912715517241378, 'end': 0.9219827586206896, 'conf': 0.9409108982174931}, 
# {'text': 'said', 'start': 0.9219827586206896, 'end': 1.2326508620689656, 'conf': 0.7700278702302796}, 
# {'text': 'to', 'start': 1.2326508620689656, 'end': 1.3529094827586206, 'conf': 0.5094435178226225}, 
# {'text': 'the', 'start': 1.3529094827586206, 'end': 1.4831896551724135, 'conf': 0.4580493446392211}, 
# {'text': 'universe', 'start': 1.4831896551724135, 'end': 2.034375, 'conf': 0.9285054256219009}, 
# {'text': 'sir', 'start': 2.3260775862068965, 'end': 3.036530172413793, 'conf': 0.0}, 
# {'text': 'i', 'start': 3.036530172413793, 'end': 3.347198275862069, 'conf': 0.7995760873559864}, 
# {'text': 'exist', 'start': 3.347198275862069, 'end': 4.078771551724138, 'conf': 0.0}]

Installation

With pip:

pip install ctc-segmentation

From the Arch Linux AUR as python-ctc-segmentation-git using your favourite AUR helper.
From source:

git clone https://github.com/lumaku/ctc-segmentation
cd ctc-segmentation
cythonize -3 ctc_segmentation/ctc_segmentation_dyn.pyx
python setup.py build
python setup.py install --optimize=1 --skip-build

How it works

1. Forward propagation

Character probabilites from each time step are obtained from a CTC-based network. With these, transition probabilities are mapped into a trellis diagram. To account for preambles or unrelated segments in audio files, the transition cost are set to zero for the start-of-sentence or blank token.

2. Backtracking

Starting from the time step with the highest probability for the last character, backtracking determines the most probable path of characters through all time steps.

3. Confidence score

As this method generates a probability for each aligned character, a confidence score for each utterance can be derived. For example, if a word within an utterance is missing, this value is low.

The confidence score helps to detect and filter-out bad utterances.

Parameters

There are several notable parameters to adjust the working of the algorithm that can be found in the class CtcSegmentationParameters:

Data preparation parameters

Localization: The character set is taken from the model dict, i.e., usually are generated with SentencePiece. An ASR model trained in the corresponding language and character set is needed. For asian languages, no changes to the CTC segmentation parameters should be necessary. One exception: If the character set contains any punctuation characters, "#", or the Greek char "ε", adapt the setting in an instance of CtcSegmentationParameters in segmentation.py.
CtcSegmentationParameters includes a blank character. Copy over the Blank character from the dictionary to the configuration, if in the model dictionary e.g. "<blank>" instead of the default "_" is used. If the Blank in the configuration and in the dictionary mismatch, the algorithm raises an IndexError at backtracking.
If replace_spaces_with_blanks is True, then spaces in the ground truth sequence are replaces by blanks. This option is enabled by default and improves compability with dictionaries with unknown space characters.

Alignment parameters

min_window_size: Minimum window size considered for a single utterance. The current default value should be OK in most cases.
To align utterances with longer unkown audio sections between them, use blank_transition_cost_zero (default: False). With this option, the stay transition in the blank state is free. A transition to the next character is only consumed if the probability to switch is higher. In this way, more time steps can be skipped between utterances. Caution: in combination with replace_spaces_with_blanks == True, this may lead to misaligned segments.

Time stamp parameters

Directly set the parameter index_duration to give the corresponding time duration of one CTC output index (in seconds).

Example: For a given sample rate, say, 16kHz, fs=16000. Then, how many sample points correspond to one ctc output index? In some ASR systems, this can be calculated from the hop length of the windowing times encoder subsampling factor. For example, if the hop length of the frontend windowing is 128, and the subsampling factor in the encoder is 4, totalling 512 sample points for one CTC index. Then index_duration = 512 / 16000.

Note: In earlier versions, index_duration was not used and the time stamps were determined from the values of subsampling_factor and frame_duration_ms. To derive index_duration from these values, calculateframe_duration_ms * subsampling_factor / 1000.

Confidence score parameters

Character probabilities over each L frames are accumulated to calculate the confidence score. The L value can be adapted with with score_min_mean_over_L . A lower L makes the score more sensitive to error in the transcription, but also errors in the ASR model.

Toolkit Integration

CTC segmentation requires CTC activations of an already trained CTC-based network. Example code can be found in the alignment scripts asr_align.py of ESPnet 1 or ESpnet 2.

Steps to alignment for regular ASR

First, the ground truth text need to be converted into a matrix: Use prepare_token_list from a ground truth sequence that was already coverted to a sequence of tokens (recommended). Alternatively, use prepare_text on raw text, this method filters characters not in the dictionary and can break longer tokens into smaller tokens.
ctc_segmentation computes character-wise alignments from the CTC log posterior probabilites.
determine_utterance_segments converts char-wise alignments to utterance-wise alignments.
In a post-processing step, segments may be filtered by their confidence value.

Steps to alignment for different use-cases

Sometimes the ground truth data is not text, but a sequence of tokens, or a list of protein segments. In this case, use either prepare_token_list or replace it with a function that suits better for your data. For examples, see the prepare_* functions in ctc_segmentation.py, or the example included in the NeMo toolkit.

Segments clean-up

Segments that were written to a segments file can be filtered using the confidence score. This is the minium confidence score in log space as described in the paper.

Utterances with a low confidence score are discarded in a data clean-up. This parameter may need adjustment depending on dataset, ASR model and used text conversion.

min_confidence_score=1.5
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${unfiltered} > ${filtered}

FAQ

How do I split a large text into multiple utterances? This can be done automatically, e.g. in our paper, the text of this book/chapter was split into utterances at sentence endings to derive utterances.
What if there are unrelated parts within the audio file without transcription? Unrelated segments can be skipped with the gratis_blank parameter. Larger unrelated segments may deteriorate the results, try to increase the minimum window size. Partially repeating segments have a high chance to disarrange the alignments, remove them if possible. These segments can be detected with the confidence score. Use the state_list to see how well the unrelated part was "ignored".
How fast is CTC segmentation? On a modern computer (64GB RAM, Nvidia RTX 2080 ti, AMD Ryzen 7 2700X), it takes around 400 ms to align 500s of audio with tokenized text and the default window size (8000~250s). In comparison, inference of CTC posteriors on CPU takes some time from 4s to 20s; GPU inference takes roughly 500 - 1000 ms, but often fails at such long audio files because of excessive memory consumption (tested with Transformer model on Espnet 2; this strongly depends on model architecture and used toolkit). A few factors influence CTC segmentation speed: Window size, length of audio, length of text, how well the text fits to audio and the preprocessing function. Aligning from tokenized text is faster because the alignment with prepare_text additionally includes transition probabilities from partial tokens; this increases the complexity by the length of the longest token in the character dictionary (including the blank token).
The inference of the model is too slow and uses too much memory. Use RNN-based ASR networks that grow linearly in complexity with longer audio files. Inference complexity increases on long audio files quadratically for Transformer-based architectures. To solve this, it is possible to partition the audio into several parts. CTC segmentation includes an example partitioning function in ctc_segmentation.get_partitions. Example Code in JtubeSpeech
How can I improve the alignment speed of CTC segmentation? The alignment algorithm is not parallelizable for batch processing, so use a CPU with a good single-thread performance. It's possible to align multiple files in parallel, if the computer has enough temporary memory. The alignment is faster with shorter max token length, if text is aligned - or directly align from a token list.
How do I get word-based alignments instead of full utterance segments? Use an ASR model with character tokens to improve the time-resolution. Then handle each word as single utterance.
How can I improve the accuracy of the generated alignments? Be aware that depending on the ASR performance of network and other factors, CTC activations are not always accurate, sometimes shifted by a few frames. To get a better time resolution, use a dictionary with characters! Also, the prepare_text function tries to break down long tokens into smaller tokens.
What is the difference between prepare_token_list and prepare_text? Explained in examples:

Example for `prepare_token_list`

Let's say we have a text text = ["cat"] and a dictionary that includes the word cat as well as its parts: char_list = ["•", "UNK", "a", "c", "t", "cat"]. The "tokenize" method that uses the preprocess_fn will produce:

text = ["cat"]
char_list = ["•", "UNK", "a", "c", "t", "cat"]
token_list = [tokenize(utt) for utt in text]
token_list
# [array([5])]
ground_truth_mat, utt_begin_indices = prepare_token_list(config, text)
ground_truth_mat
# array([[-1],
#        [ 0],
#        [ 5],
#        [ 0]])

Example for `prepare_text`

Toy example:

text = ["cat"]
char_list = ["•", "UNK", "a", "c", "t", "cat"]
ground_truth_mat, utt_begin_indices = prepare_text(config, text, char_list)
# array([[-1, -1, -1],
#        [ 0, -1, -1],
#        [ 3, -1, -1],
#        [ 2, -1, -1],
#        [ 4, -1,  5],
#        [ 0, -1, -1]])

Here, the partial characters are detected (3,2,4), as well as the full "cat" token (5). This is done to have a better time resolution for the alignment.

Full example with a bpe 500 model char list from Tedlium 2:

from ctc_segmentation import CtcSegmentationParameters
from ctc_segmentation import prepare_text

char_list = [ "<unk>", "'", "a", "ab", "able", "ace", "ach", "ack",
"act","ad","ag", "age", "ain", "al", "alk", "all", "ally", "am",
"ame", "an","and", "ans", "ant", "ap", "ar", "ard", "are",
"art","as", "ase","ass", "ast", "at", "ate", "ated", "ater", "ation",
"ations", "ause","ay", "b", "ber", "ble", "c", "ce", "cent",
"ces","ch", "ci", "ck","co", "ct", "d", "de", "du", "e", "ear",
"ect", "ed", "een", "el","ell", "em", "en", "ence", "ens",
"ent","enty", "ep", "er", "ere","ers", "es", "ess", "est", "et", "f",
"fe", "ff", "g", "ge", "gh","ght", "h", "her", "hing", "ht",
"i","ia", "ial", "ib", "ic","ical", "ice", "ich", "ict", "id", "ide",
"ie", "ies", "if","iff","ig", "ight", "ign", "il", "ild", "ill","im",
"in", "ind","ine", "ing", "ink", "int", "ion", "ions", "ip",
"ir","ire","is","ish", "ist", "it", "ite", "ith", "itt", "ittle",
"ity", "iv","ive", "ix", "iz", "j", "k", "ke", "king", "l",
"ld","le","ll","ly", "m", "ment", "ms", "n", "nd", "nder", "nt", "o",
"od","ody", "og", "ol", "olog", "om", "ome", "on",
"one","ong","oo","ood", "ook", "op", "or", "ore", "orm", "ort",
"ory", "os","ose", "ot", "other", "ou", "ould", "ound",
"ount","our","ous","ousand", "out", "ow", "own", "p", "ph", "ple",
"pp", "pt","q", "qu", "r", "ra", "rain", "re", "reat", "red",
"ree","res","ro","rou", "rough", "round", "ru", "ry", "s", "se",
"sel", "so","st","t", "ter", "th", "ther", "ty", "u", "ually",
"ud","ue", "ul","ult", "um", "un", "und", "ur", "ure", "us", "use",
"ust", "ut","v","ve", "vel", "ven", "ver", "very", "ves", "ving","w",
"way","x", "y", "z", "ăť", "ō", "▁", "▁a", "▁ab",
"▁about","▁ac","▁act","▁actually", "▁ad", "▁af", "▁ag", "▁al",
"▁all", "▁also","▁am", "▁an", "▁and", "▁any", "▁ar",
"▁are","▁around", "▁as","▁at","▁b", "▁back", "▁be", "▁bec",
"▁because", "▁been", "▁being","▁bet", "▁bl", "▁br", "▁bu",
"▁but","▁by", "▁c", "▁call","▁can","▁ch", "▁chan", "▁cl", "▁co",
"▁com", "▁comm", "▁comp","▁con", "▁cont", "▁could", "▁d",
"▁day","▁de", "▁des","▁did","▁diff", "▁differe", "▁different",
"▁dis", "▁do", "▁does","▁don", "▁down", "▁e", "▁en",
"▁even","▁every", "▁ex", "▁exp","▁f","▁fe", "▁fir", "▁first",
"▁five", "▁for", "▁fr", "▁from", "▁g","▁get", "▁go",
"▁going","▁good", "▁got", "▁h", "▁ha","▁had","▁happ", "▁has",
"▁have", "▁he", "▁her", "▁here", "▁his","▁how", "▁hum",
"▁hundred","▁i", "▁ide", "▁if", "▁im", "▁imp","▁in","▁ind", "▁int",
"▁inter", "▁into", "▁is", "▁it", "▁j", "▁just","▁k","▁kind",
"▁kn","▁know", "▁l", "▁le", "▁let", "▁li", "▁life","▁like",
"▁little", "▁lo", "▁look", "▁lot", "▁m",
"▁ma","▁make","▁man","▁many", "▁may", "▁me", "▁mo", "▁more",
"▁most","▁mu", "▁much", "▁my", "▁n", "▁ne", "▁need", "▁new",
"▁no","▁not","▁now", "▁o", "▁of", "▁on", "▁one","▁only", "▁or",
"▁other", "▁our","▁out", "▁over", "▁p", "▁part", "▁pe",
"▁peop","▁people","▁per","▁ph", "▁pl", "▁po", "▁pr", "▁pre", "▁pro",
"▁put", "▁qu", "▁r","▁re", "▁real", "▁really", "▁res",
"▁right","▁ro", "▁s","▁sa","▁said", "▁say", "▁sc", "▁se", "▁see",
"▁sh", "▁she", "▁show", "▁so","▁som", "▁some", "▁somet","▁something",
"▁sp","▁spe", "▁st","▁start", "▁su", "▁sy", "▁t", "▁ta",
"▁take","▁talk", "▁te","▁th","▁than", "▁that", "▁the", "▁their",
"▁them", "▁then", "▁there","▁these", "▁they", "▁thing",
"▁things","▁think", "▁this","▁those","▁thousand", "▁three",
"▁through", "▁tim", "▁time", "▁to", "▁tr","▁tw", "▁two", "▁u",
"▁un","▁under", "▁up", "▁us","▁v", "▁very","▁w", "▁want", "▁was",
"▁way", "▁we", "▁well", "▁were","▁wh","▁what", "▁when",
"▁where","▁which", "▁who", "▁why", "▁will","▁with", "▁wor", "▁work",
"▁world", "▁would", "▁y","▁year","▁years", "▁you", "▁your"]

text = ["I ▁really ▁like ▁technology",
 "The ▁quick ▁brown ▁fox ▁jumps ▁over ▁the ▁lazy ▁dog.",
 "unknown chars äüößß-!$ "]
config = CtcSegmentationParameters()
config.char_list = char_list
ground_truth_mat, utt_begin_indices = prepare_text(config, text)

# ground_truth_mat
# array([[ -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  0,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [190, 410,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55, 193, 411,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  2,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [137,  13,  -1,  -1, 412,  -1,  -1,  -1,  -1,  -1],
#        [137, 140,  15,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [240, 141,  -1,  16,  -1,  -1, 413,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [137, 356,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 87,  -1, 359,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [134,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55, 135,  -1,  -1, 361,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [209, 438,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55,  -1, 442,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 43,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 83,  47,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [145,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [137, 153,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 79, 152,  -1, 154,  -1,  -1,  -1,  -1,  -1,  -1],
#        [240,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  0,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 83,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [188,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [214, 189, 409,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 87,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 43,  91,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [134,  49,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 40, 266,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [190,  -1, 275,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149, 198,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [237, 181,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [145,  -1, 182,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 76, 311,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [239,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [133, 350,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [214,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [142, 220,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [183,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [204,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149, 386,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [229,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55, 230,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [190,  69, 233,  -1, 395,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [209, 438,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 83, 211, 443,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 55,  -1,  -1, 446,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [137, 356,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  2,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [241,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [240,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [244,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 52, 292,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149,  -1, 301,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 79, 152,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  0,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [214,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [145, 221,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [134,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [145,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [149,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [237, 181,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [145,  -1, 182,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 43,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [ 83,  47,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  2,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [190,  24,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [204,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1],
#        [  0,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1]])

In the example, parts of the word "▁really" were separated into the token ids: [244, 410, 411, 412, 413] This corresponds to ['▁', '▁r', '▁re', '▁real', '▁really']

The CTC segmentation algorithm then iterates over these tokens in the ground truth, calculates the transition probabilities for each token from lpz and decides for the transition(s) with the token combination that has the highest accumulated transition probability.

Sometimes the end of the last utterance is cut short. How do I solve this? This is a known issue and strongly depends on used ASR model. A possible solution might be to just add a few milliseconds to the end of the last utterance. It's also practical to apply a threshold on the mean absolute (MA) signal, as described by Bakhturina et al..

Reference

The full paper can be found in the preprint https://arxiv.org/abs/2007.09127 or published at https://doi.org/10.1007/978-3-030-60276-5_27. The code used in the paper is archived in https://github.com/cornerfarmer/ctc_segmentation. To cite this work:

@InProceedings{ctcsegmentation,
author="K{\"u}rzinger, Ludwig
and Winkelbauer, Dominik
and Li, Lujun
and Watzel, Tobias
and Rigoll, Gerhard",
editor="Karpov, Alexey
and Potapova, Rodmonga",
title="CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition",
booktitle="Speech and Computer",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="267--278",
abstract="Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance.",
isbn="978-3-030-60276-5"
}

ctc-segmentation's People

Contributors

Stargazers

Watchers

ctc-segmentation's Issues

Timing squeezed in the beginning

Dear authors,
I tried to use your library to align a true-cased text containing punctuation but I have a problem with the timings obtained because they all seem squeezed in the beginning.
I set the index_duration to 0.04 since I extract features every 10ms and I have a subsample of 4 at the beginning.
My tokenized textual predictions look like the following:
▁But ▁if ▁you ▁could ▁take ▁a ▁pill ▁ <eol> ▁or ▁a ▁vaccine , ▁ <eob> ▁and ▁just ▁like ▁getting ▁over ▁a ▁cold , ▁ <eob> ▁you ▁could ▁heal ▁your ▁wind ▁faster ? ▁ <eob>
Where <eob> and <eol> are treated as special characters in my vocabulary.
I select <eob> as a split token i.e., a sentence is split when we found <eob> in the text.
The timings obtained are:
0.04-1.52 1.52-2.24 2.24-5.44
And the first thing that is not correct is that the total duration of the segment is 6.75s. I looked at the timings obtained from the ctc segmentation and their last value is 5.44s.
The other thing is that, if I compute the interval between the timings obtained by your library I got:
1.48s 0.32s 2.20s
but, if I listen to the audio and compute them they are almost:
1.9s 2s 1.7s
Also if I look at other examples I can observe the same phenomenon, it seems that all the timinings are squeezed towards the beginning of the sentence.
I have used both prepare_text and prepare_token_list but it is not the cause of the problem.
Have you any hint on where the problem is?
Thank you in advance

Better compability with tokenized text

If text is already given as token sequence separated by spaces, the function prepare_text() should be adapted as discussed in espnet/espnet#2592.
The function prepare_tokenized_text() is useful for this.
It can be found on the development branch: https://github.com/lumaku/ctc-segmentation/tree/prepare_tokenized_text

Align text from wav2vec2

How to use ctc-segmentation with wav2vec2?
The down below code works - but is not properly aligned.

This wav: meisterfloh.zip
This code:

import torch, transformers, ctc_segmentation
import soundfile

# wav2vec2
model_file = 'facebook/wav2vec2-large-xlsr-53-german'
vocab_dict = {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "E": 5, "N": 6, "I": 7, "S": 8, "R": 9, "T": 10, "A": 11, "H": 12, "D": 13, "U": 14, "L": 15, "C": 16, "G": 17, "M": 18, "O": 19, "B": 20, "W": 21, "F": 22, "K": 23, "Z": 24, "V": 25, "Ü": 26, "P": 27, "Ä": 28, "Ö": 29, "J": 30, "Y": 31, "'": 32, "X": 33, "Q": 34, "-": 35}

processor = transformers.Wav2Vec2Processor.from_pretrained( model_file )
model = transformers.Wav2Vec2ForCTC.from_pretrained( model_file )

speech_array, sampling_rate = soundfile.read( 'meisterfloh.wav' )
assert sampling_rate == 16000
features = processor(speech_array,sampling_rate=16000, return_tensors="pt")
input_values = features.input_values
attention_mask = features.attention_mask
with torch.no_grad():
    logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription = transcription.lower()

# ctc-segmentation
config = ctc_segmentation.CtcSegmentationParameters()
with torch.no_grad():
    softmax = torch.nn.Softmax(dim = -1)
    lpz = softmax(logits)[0].cpu().numpy()
char_list = [x.lower() for x in vocab_dict.keys()]
ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription,char_list)
timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, lpz, ground_truth_mat)
segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)

# dump
for word, segment in zip(transcription.split(' '), segments):
    print( word, segment )

score_min_mean_over_L question and data preporation

Hello.

Could you please help with a few questions:

Have you tried different values for score_min_mean_over_L https://github.com/lumaku/ctc-segmentation/blob/master/ctc_segmentation/ctc_segmentation.py#L42?
Could you please provide the intuition behind the value and how it is related to the frame duration?
The paper mentions To perform CTC-segmentation on the Librivox corpus, we combined the audio files with the CTC-Segmentation of Large Corpora for German Speech Recognition 9 corresponding ground truth text pieces from Project Gutenberg-DE [6].- do you mean that you combined all audio Librivox pieces into a large audio file? Or did you use the original Librivox audio segments and cut the text into respective pieces? If so, did you manually cut the text or automatically with some text overlap? Any observations on how the algorithm performs with some phrases in the middle of the audio that don't have a corresponding transcript?

Thank you.

Package building works with Distutils, but not with Setuptools

Python packaging using Setuptools fails, but the build with Distutils works.

Setuptools:

$ python3 -m setuptools.launch setup.py sdist
Compiling ctc_segmentation/ctc_segmentation_dyn.pyx because it changed.
[1/1] Cythonizing ctc_segmentation/ctc_segmentation_dyn.pyx
/usr/lib/python3.8/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /xxx/ctc-segmentation/ctc_segmentation/ctc_segmentation_dyn.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running sdist
running egg_info
creating ctc_segmentation.egg-info
writing ctc_segmentation.egg-info/PKG-INFO
writing dependency_links to ctc_segmentation.egg-info/dependency_links.txt
writing requirements to ctc_segmentation.egg-info/requires.txt
writing top-level names to ctc_segmentation.egg-info/top_level.txt
writing manifest file 'ctc_segmentation.egg-info/SOURCES.txt'
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/setuptools/launch.py", line 36, in <module>
    run()
  File "/usr/lib/python3.8/site-packages/setuptools/launch.py", line 32, in run
    exec(code, namespace)
  File "setup.py", line 39, in <module>
    setup(
  File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 163, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.8/site-packages/setuptools/command/sdist.py", line 45, in run
    self.run_command('egg_info')
  File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 297, in run
    self.find_sources()
  File "/usr/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 304, in find_sources
    mm.run()
  File "/usr/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 535, in run
    self.add_defaults()
  File "/usr/lib/python3.8/site-packages/setuptools/command/egg_info.py", line 571, in add_defaults
    sdist.add_defaults(self)
  File "/usr/lib/python3.8/distutils/command/sdist.py", line 228, in add_defaults
    self._add_defaults_ext()
  File "/usr/lib/python3.8/distutils/command/sdist.py", line 311, in _add_defaults_ext
    build_ext = self.get_finalized_command('build_ext')
  File "/usr/lib/python3.8/distutils/cmd.py", line 299, in get_finalized_command
    cmd_obj.ensure_finalized()
  File "/usr/lib/python3.8/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "setup.py", line 9, in finalize_options
    __builtins__.__NUMPY_SETUP__ = False
AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__'

Distutils:

$ python setup.py sdist
running sdist
running egg_info
writing ctc_segmentation.egg-info/PKG-INFO
writing dependency_links to ctc_segmentation.egg-info/dependency_links.txt
writing requirements to ctc_segmentation.egg-info/requires.txt
writing top-level names to ctc_segmentation.egg-info/top_level.txt
reading manifest file 'ctc_segmentation.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'ctc_segmentation.egg-info/SOURCES.txt'
running check
warning: Check: missing meta-data: if 'author' supplied, 'author_email' must be supplied too

creating ctc_segmentation-1.0.6
creating ctc_segmentation-1.0.6/ctc_segmentation
creating ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying files to ctc_segmentation-1.0.6...
copying MANIFEST.in -> ctc_segmentation-1.0.6
copying README.md -> ctc_segmentation-1.0.6
copying setup.py -> ctc_segmentation-1.0.6
copying ctc_segmentation/__init__.py -> ctc_segmentation-1.0.6/ctc_segmentation
copying ctc_segmentation/ctc_segmentation.py -> ctc_segmentation-1.0.6/ctc_segmentation
copying ctc_segmentation/ctc_segmentation_dyn.c -> ctc_segmentation-1.0.6/ctc_segmentation
copying ctc_segmentation/ctc_segmentation_dyn.pyx -> ctc_segmentation-1.0.6/ctc_segmentation
copying ctc_segmentation.egg-info/PKG-INFO -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying ctc_segmentation.egg-info/SOURCES.txt -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying ctc_segmentation.egg-info/dependency_links.txt -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying ctc_segmentation.egg-info/not-zip-safe -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying ctc_segmentation.egg-info/requires.txt -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
copying ctc_segmentation.egg-info/top_level.txt -> ctc_segmentation-1.0.6/ctc_segmentation.egg-info
Writing ctc_segmentation-1.0.6/setup.cfg
creating dist
Creating tar archive
removing 'ctc_segmentation-1.0.6' (and everything under it)

Batched GPU implementation of CTC segmentation

Implement CTC segmentation in pytorch to enable batched GPU operations.
Discussed in espnet/espnet#2192
Inspiration for possible optimizations: https://ieeexplore.ieee.org/abstract/document/8268944

Words in char list

Hi, thanks for this contribution!
I'm working on a alignment on phonetic level for a fine tuned Wav2vec2 model.

The char list in my case consist of two ascii chars each. I have seen that in the prepare_text function, for every occurrence in the text (in my case for example: ['ab', 'ca', ...]), each character is checked against the char list. But the char list consists of these char pairs, so its not appended to the ground truth.

Am I missing something here? Do I have to adjust the code to my needs? Maybe an additional attribute in the config would help.

installation fails on Linux (Ubuntu 22.10)

Hi,

I have Ubuntu 22.10
I'm trying to install ctc-segmentation, but I have this error :

Collecting ctc-segmentation
  Using cached ctc_segmentation-1.7.4.tar.gz (73 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy in ./venv/lib/python3.7/site-packages (from ctc-segmentation) (1.21.6)
Requirement already satisfied: Cython in ./venv/lib/python3.7/site-packages (from ctc-segmentation) (0.29.33)
Requirement already satisfied: setuptools in ./venv/lib/python3.7/site-packages (from ctc-segmentation) (67.3.3)
Building wheels for collected packages: ctc-segmentation
  Building wheel for ctc-segmentation (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for ctc-segmentation (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [38 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-37
      creating build/lib.linux-x86_64-cpython-37/ctc_segmentation
      copying ctc_segmentation/partitioning.py -> build/lib.linux-x86_64-cpython-37/ctc_segmentation
      copying ctc_segmentation/__init__.py -> build/lib.linux-x86_64-cpython-37/ctc_segmentation
      copying ctc_segmentation/ctc_segmentation.py -> build/lib.linux-x86_64-cpython-37/ctc_segmentation
      running build_ext
      building 'ctc_segmentation.ctc_segmentation_dyn' extension
      creating build/temp.linux-x86_64-cpython-37
      creating build/temp.linux-x86_64-cpython-37/ctc_segmentation
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/tmp/pip-build-env-b3izg_6t/overlay/lib/python3.7/site-packages/numpy/core/include -I/home/dewi/Labourva/sintezenn-ar-vouezh/venv/include -I/home/dewi/.pyenv/versions/3.7.12/include/python3.7m -c ctc_segmentation/ctc_segmentation_dyn.c -o build/temp.linux-x86_64-cpython-37/ctc_segmentation/ctc_segmentation_dyn.o
      In file included from /tmp/pip-build-env-b3izg_6t/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1816,
                       from /tmp/pip-build-env-b3izg_6t/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                       from /tmp/pip-build-env-b3izg_6t/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                       from ctc_segmentation/ctc_segmentation_dyn.c:765:
      /tmp/pip-build-env-b3izg_6t/overlay/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
         15 | #warning "Using deprecated NumPy API, disable it by " \
            |  ^~~~~~~
      gcc -pthread -shared -L/home/dewi/.pyenv/versions/3.7.12/lib -L/home/dewi/.pyenv/versions/3.7.12/lib build/temp.linux-x86_64-cpython-37/ctc_segmentation/ctc_segmentation_dyn.o -o build/lib.linux-x86_64-cpython-37/ctc_segmentation/ctc_segmentation_dyn.cpython-37m-x86_64-linux-gnu.so
      /home/linuxbrew/.linuxbrew/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: skipping incompatible /lib/x86_64-linux-gnu/libc.so.6 when searching for /lib/x86_64-linux-gnu/libc.so.6
      /home/linuxbrew/.linuxbrew/bin/ld: cannot find /lib/x86_64-linux-gnu/libc.so.6
      /home/linuxbrew/.linuxbrew/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: skipping incompatible /lib/x86_64-linux-gnu/libc.so.6 when searching for /lib/x86_64-linux-gnu/libc.so.6
      /home/linuxbrew/.linuxbrew/bin/ld: /lib64/ld-linux-x86-64.so.2: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: /lib64/ld-linux-x86-64.so.2: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: skipping incompatible /lib64/ld-linux-x86-64.so.2 when searching for /lib64/ld-linux-x86-64.so.2
      /home/linuxbrew/.linuxbrew/bin/ld: cannot find /lib64/ld-linux-x86-64.so.2
      /home/linuxbrew/.linuxbrew/bin/ld: /lib64/ld-linux-x86-64.so.2: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: /lib64/ld-linux-x86-64.so.2: unknown type [0x13] section `.relr.dyn'
      /home/linuxbrew/.linuxbrew/bin/ld: skipping incompatible /lib64/ld-linux-x86-64.so.2 when searching for /lib64/ld-linux-x86-64.so.2
      collect2: error: ld returned 1 exit status
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for ctc-segmentation
Failed to build ctc-segmentation
ERROR: Could not build wheels for ctc-segmentation, which is required to install pyproject.toml-based projects

CTC Segmentation for German

Hello, I have to split the audio files in my dataset and their corresponding transcripts as well. Is there any pretrained model of yours for German language?

Failed build in Anaconda environment

If the build fails within an Anaconda environment, the problem ifs often because the ld version of the Anaconda environment differs from the system version. The solution is to delete the anaconda ld.

The error looks like this:

  Building wheel for ctc-segmentation (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /xxx/espnet/tools/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-ydww4_nh/ctc-segmentation/setup.py'"'"'; __file__='"'"'/tmp/pip-install-ydww4_nh/ctc-segmentation/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-xle6m99n
       cwd: /tmp/pip-install-ydww4_nh/ctc-segmentation/
  Complete output (28 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.7
  creating build/lib.linux-x86_64-3.7/ctc_segmentation
  copying ctc_segmentation/ctc_segmentation.py -> build/lib.linux-x86_64-3.7/ctc_segmentation
  copying ctc_segmentation/__init__.py -> build/lib.linux-x86_64-3.7/ctc_segmentation
  running build_ext
  building 'ctc_segmentation.ctc_segmentation_dyn' extension
  creating build/temp.linux-x86_64-3.7
  creating build/temp.linux-x86_64-3.7/ctc_segmentation
  gcc-7 -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/xxx/espnet/tools/venv/include/python3.7m -I/xxx/espnet/tools/venv/lib/python3.7/site-packages/numpy/core/include -c ctc_segmentation/ctc_segmentation_dyn.c -o build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o
  In file included from /xxx/espnet/tools/venv/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
                   from /xxx/espnet/tools/venv/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
                   from /xxx/espnet/tools/venv/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
                   from ctc_segmentation/ctc_segmentation_dyn.c:623:
  /xxx/espnet/tools/venv/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: Warnung: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
   #warning "Using deprecated NumPy API, disable it with " \
    ^~~~~~~
  gcc -pthread -shared -B /xxx/espnet/tools/venv/compiler_compat -L/xxx/espnet/tools/venv/lib -Wl,-rpath=/xxx/espnet/tools/venv/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o -o build/lib.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.cpython-37m-x86_64-linux-gnu.so
  /xxx/espnet/tools/venv/compiler_compat/ld: build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o: unable to initialize decompress status for section .debug_info
  /xxx/espnet/tools/venv/compiler_compat/ld: build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o: unable to initialize decompress status for section .debug_info
  /xxx/espnet/tools/venv/compiler_compat/ld: build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o: unable to initialize decompress status for section .debug_info
  /xxx/espnet/tools/venv/compiler_compat/ld: build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o: unable to initialize decompress status for section .debug_info
  build/temp.linux-x86_64-3.7/ctc_segmentation/ctc_segmentation_dyn.o: file not recognized: file format not recognized
  collect2: Fehler: ld gab 1 als Ende-Status zurück
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for ctc-segmentation

The problem of min_avg ?

when i determine utterance segment, my char_probs is softmax, not log_softmax, so
the line https://github.com/lumaku/ctc-segmentation/blob/69bd9b53b7b82ad926d35e7b280f957ed299a7db/ctc_segmentation/ctc_segmentation.py#L403C1-L403C1
in determine_utterance_segments function has some error.
i think should be:

min_avg = np.finfo(np.float64).max
or 
min_avg = np.float64(999999999.0)

and line 404

for t in range(start_t, end_t - n +1):

sentence segmentation in Wav2vec2

how can i use ctc to segment sentence in Wav2vec2? the exsample just segment words

Why is the confidence level of all sentences 0?

Question: Is text sentence segmentation required?

Hi,
I have really long audio files (1hr) and a subtitle file with thousands of words.
However there's no punctuation in the subtitle file.
I've read nemo implementation and it requires the text/transcript to be separated by punctuation symbols such as '.'.

Was wondering can I just use the long subtitle file? (Based on the paper, it doesn't seem to mention this issue.. Please correct me if I'm wrong.)
Why text should be splitted into sentences if we can't use long subtitle file?

Thanks!

How to solve "Audio is shorter than text" Error?

Hello,
I want to do a characters-alignment for a recording. I am using your “ctc_segmentation” tool, and I am sending each character in a new line in the input text file.
For example, for “IT GAVE” I will send a file with the following :
utt0 I
utt1 T
utt2 _
utt3 G
utt4 A
utt5 V
utt6 E
For some recordings it works pretty well. The problem is that sometimes I get the error :
“Audio is shorter than text!”
I understand that is something about the ratio between the utterances to be aligned and audio length, but how can I solve this problem? Could tuning the ‘Time stamp parameters’ solve the issue?

cann't find 'gratis_blank'

whit if the model has subsample , just change the timming ?

The problem about last phoneme alignment

Hi, thanks for this great job,
I have tried to integrate it on the top of my asr module, most of the phonemes were aligned perfect except the last, as can see in the below.

the top figure was the original wavform, and the bottom was the alignment result.
I found the wavform approach the end was cut down, and the index_duration was right because the phonemes except the last were aligned accurately.

So how can I solve this problem? thanks in advance.

IndexError: out of bounds

This wave file: pl.zip

This code:

import torch, transformers, ctc_segmentation
import soundfile

# wav2vec2
model_file = 'jonatasgrosman/wav2vec2-large-xlsr-53-polish'
vocab_dict = {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "A": 5, "I": 6, "E": 7, "O": 8, "Z": 9, "N": 10, "S": 11, "W": 12, "R": 13, "C": 14, "Y": 15, "M": 16, "T": 17, "D": 18, "K": 19, "P": 20, "Ł": 21, "J": 22, "U": 23, "L": 24, "B": 25, "Ę": 26, "G": 27, "Ą": 28, "Ż": 29, "H": 30, "Ś": 31, "Ó": 32, "Ć": 33, "F": 34, "Ń": 35, "Ź": 36, "V": 37, "-": 38, "Q": 39, "X": 40, "'": 41}

processor = transformers.Wav2Vec2Processor.from_pretrained( model_file )
model = transformers.Wav2Vec2ForCTC.from_pretrained( model_file )

speech_array, sampling_rate = soundfile.read( '/tmp/pl.wav' )
assert sampling_rate == 16000
features = processor(speech_array,sampling_rate=16000, return_tensors="pt")
input_values = features.input_values
attention_mask = features.attention_mask
with torch.no_grad():
    logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription = transcription.lower().split()

# ctc-segmentation
with torch.no_grad():
    softmax = torch.nn.LogSoftmax(dim=-1)
    lpz = softmax(logits)[0].cpu().numpy()
config = ctc_segmentation.CtcSegmentationParameters()
config.index_duration = speech_array.shape[0] / lpz.shape[0] / sampling_rate
char_list = [x.lower() for x in vocab_dict.keys()]
ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription,char_list)
timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, lpz, ground_truth_mat)
segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)

Console:

Traceback (most recent call last):
  File "ctc.py", line 31, in <module>
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)
  File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 387, in determine_utterance_segments
    start = compute_time(utt_begin_indices[i], "begin")
  File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 380, in compute_time
    return max(timings[index + 1] - 0.5, middle)
IndexError: index 450 is out of bounds for axis 0 with size 450

Installation fails on Windows 10

Hi, the installation via PIP fails on Windows 10 19043.1766 with the following error message.

I have installed the Visual Studio Windows 10 SDK and Microsoft Visual C++ 14.0.

pip install ctc-segmentation==1.7.1

Collecting ctc-segmentation==1.7.1
  Using cached ctc_segmentation-1.7.1.tar.gz (71 kB)
  Preparing metadata (setup.py) ... done

[...]

Building wheels for collected packages: ctc-segmentation
  Building wheel for ctc-segmentation (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __file__='"'"'[...]\\AppData
\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(comp
ile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d '[...]\AppData\Local\Temp\pip-wheel-zx7xruid'
       cwd: [...]\AppData\Local\Temp\pip-install-qscp16gf\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\
  Complete output (28 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.9
  creating build\lib.win-amd64-3.9\ctc_segmentation
  copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-3.9\ctc_segmentation
  copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-3.9\ctc_segmentation
  copying ctc_segmentation\__init__.py -> build\lib.win-amd64-3.9\ctc_segmentation
  running build_ext
  creating build\temp.win-amd64-3.9
  creating build\temp.win-amd64-3.9\Release
  creating build\temp.win-amd64-3.9\Release\ctc_segmentation
  "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I[...]\espnet-venv\include -I[...]\AppData\Local\Programs\Python\Python39\include -I[...]\AppData\Local\Programs\Python\Python39\Include -IC:\Users\Fab
ian\PyCharmProjects\MA_Tuyet\espnet-venv\lib\site-packages\numpy\core\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt" "-IC:\Program Files (x86)\Windows K
its\10\include\10.0.20348.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt" /Tcctc_segmentation/ctc_segmentation_dyn.c /Fobuild\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj
  ctc_segmentation_dyn.c
  c:\users\fabian\pycharmprojects\ma_tuyet\espnet-venv\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
  ctc_segmentation/ctc_segmentation_dyn.c(2338): warning C4244: "=": Konvertierung von "double" in "float", m”glicher Datenverlust
  ctc_segmentation/ctc_segmentation_dyn.c(2482): warning C4244: "Funktion": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
  ctc_segmentation/ctc_segmentation_dyn.c(2499): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
  ctc_segmentation/ctc_segmentation_dyn.c(2512): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
  ctc_segmentation/ctc_segmentation_dyn.c(3240): warning C4244: "=": Konvertierung von "npy_intp" in "int", m”glicher Datenverlust
  "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:[...]\espnet-venv\libs /LIBPATH:[...]\AppData\Local\Programs\Python\Python39\libs /LIBPATH:[...]\AppData\
Local\Programs\Python\Python39 /LIBPATH:[...]\espnet-venv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0
.20348.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.20348.0\um\x64" /EXPORT:PyInit_ctc_segmentation_dyn build\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj /OUT:build\lib.win-amd64-3.9\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.9\
Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib
  ctc_segmentation_dyn.obj : warning LNK4197: Export "PyInit_ctc_segmentation_dyn" wurde mehrmals angegeben; erste Angabe wird verwendet.
     Bibliothek "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib" und Objekt "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.exp" werden erstellt.
  Code wird generiert.
  Codegenerierung ist abgeschlossen.
  LINK : fatal error LNK1327: Fehler beim Ausführen von rc.exe
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit code 1327
  ----------------------------------------
  ERROR: Failed building wheel for ctc-segmentation
  Running setup.py clean for ctc-segmentation
Failed to build ctc-segmentation
Installing collected packages: ctc-segmentation
    Running setup.py install for ctc-segmentation ... error
    ERROR: Command errored out with exit status 1:
     command: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __file__='"'"'[...]\\AppDa
ta\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(co
mpile(code, __file__, '"'"'exec'"'"'))' install --record '[...]\AppData\Local\Temp\pip-record-skx2310s\install-record.txt' --single-version-externally-managed --compile --install-headers '[...]\espnet-venv\include\site\python3.9\ctc-segmentation'
         cwd: [...]\AppData\Local\Temp\pip-install-qscp16gf\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\
    Complete output (30 lines):
    running install
    [...]\espnet-venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.9
    creating build\lib.win-amd64-3.9\ctc_segmentation
    copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-3.9\ctc_segmentation
    copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-3.9\ctc_segmentation
    copying ctc_segmentation\__init__.py -> build\lib.win-amd64-3.9\ctc_segmentation
    running build_ext
    creating build\temp.win-amd64-3.9
    creating build\temp.win-amd64-3.9\Release
    creating build\temp.win-amd64-3.9\Release\ctc_segmentation
    "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -I[...]\espnet-venv\include -I[...]\AppData\Local\Programs\Python\Python39\include -I[...]\AppData\Local\Programs\Python\Python39\Include -IC:\Users\F
abian\PyCharmProjects\MA_Tuyet\espnet-venv\lib\site-packages\numpy\core\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\ucrt" "-IC:\Program Files (x86)\Windows
 Kits\10\include\10.0.20348.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.20348.0\winrt" /Tcctc_segmentation/ctc_segmentation_dyn.c /Fobuild\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj
    ctc_segmentation_dyn.c
    c:\users\fabian\pycharmprojects\ma_tuyet\espnet-venv\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
    ctc_segmentation/ctc_segmentation_dyn.c(2338): warning C4244: "=": Konvertierung von "double" in "float", m”glicher Datenverlust
    ctc_segmentation/ctc_segmentation_dyn.c(2482): warning C4244: "Funktion": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
    ctc_segmentation/ctc_segmentation_dyn.c(2499): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
    ctc_segmentation/ctc_segmentation_dyn.c(2512): warning C4244: "=": Konvertierung von "npy_intp" in "long", m”glicher Datenverlust
    ctc_segmentation/ctc_segmentation_dyn.c(3240): warning C4244: "=": Konvertierung von "npy_intp" in "int", m”glicher Datenverlust
    "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe" /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:[...]\espnet-venv\libs /LIBPATH:[...]\AppData\Local\Programs\Python\Python39\libs /LIBPATH:[...]\AppDat
a\Local\Programs\Python\Python39 /LIBPATH:[...]\espnet-venv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10
.0.20348.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.20348.0\um\x64" /EXPORT:PyInit_ctc_segmentation_dyn build\temp.win-amd64-3.9\Release\ctc_segmentation/ctc_segmentation_dyn.obj /OUT:build\lib.win-amd64-3.9\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.
9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib
    ctc_segmentation_dyn.obj : warning LNK4197: Export "PyInit_ctc_segmentation_dyn" wurde mehrmals angegeben; erste Angabe wird verwendet.
       Bibliothek "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.lib" und Objekt "build\temp.win-amd64-3.9\Release\ctc_segmentation\ctc_segmentation_dyn.cp39-win_amd64.exp" werden erstellt.
    Code wird generiert.
    Codegenerierung ist abgeschlossen.
    LINK : fatal error LNK1327: Fehler beim Ausführen von rc.exe
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\link.exe' failed with exit code 1327
    ----------------------------------------
ERROR: Command errored out with exit status 1: '[...]\espnet-venv\Scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"'; __fil
e__='"'"'[...]\\AppData\\Local\\Temp\\pip-install-qscp16gf\\ctc-segmentation_acdb144a29b540e48370d1fc88efaec6\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"'
, '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record '[...]\AppData\Local\Temp\pip-record-skx2310s\install-record.txt' --single-version-externally-managed --compile --install-headers '[...]\espnet-venv\include\site\python3.9\ctc-segmentation
' Check the logs for full command output.
WARNING: You are using pip version 21.3.1; however, version 22.1.2 is available.
You should consider upgrading via the '[...]\espnet-venv\Scripts\python.exe -m pip install --upgrade pip' command.

Can somebody help me? Cheers

Differences between lumaku and cornerfarmer implementations

First of all marvelous piece of work done here! Thanks @lumaku , your continued participation in ASR projects as well is invaluable!!

I had a query regarding differences between the implementation and features exposed by this repo and the repo at cornerfarmers repo.

Are there any differences wrt implementation? What about

Performance? GPU computability - I understand the algo is not meant for the GPU and instead better on a strong single core CPU unit, but mentions of using RNN instead confuse me. Perhaps that's only for getting the logits? This part confused me a little, as well as trying to search for such a suitable pretrained RNN+ctc based character outputting STT implementation.
Working with longer audio files
Any other interfaces exposed or features provided

I understand that this repo is based off on the cornerfarmer one, as that is the code for the paper (DOI: 10.1007/978-3-030-60276-5_27), but would like to ask the author @lumaku if there are any insights to be gained here.

Of particular use would be for my use case of force aligning long form audio using ctc-segmentation using ASR-generated transcripts. Any insights regarding this would be appreciated as well, otherwise I can create a new topic if that is more acceptable!

Thanks again @lumaku !

Super large audio file problems

Thanks for your work!

The time length of my audio file is more than 1 hour, so there are some problems when I tried your example codes in ESPNet2 and NeMo using my own data.

In ESPNet2, either my gpu memory nor my cpu memory isn't large enough to run the code.

In NeMo, it hints my as followings.

INFO:root:CTC segmentation of 62154 chars to 8700.20s audio (217505 indices).
WARNING:root:IndexError: Backtracking was not successful, the window size might be too small.
WARNING:root:Increasing the window size to: 64000
WARNING:root:IndexError: Backtracking was not successful, the window size might be too small.
ERROR:root:Maximum window size reached.
ERROR:root:Check data and character list!

What is the difference between <blank> and self-transition?

Thank you for providing this useful toolkit! I am new to it and is learning it, as I know, in ctc means the continuing the last character, then what does the self-transition mean? Can I treat them as the same?

Is there any scale for confidence scores?

@lumaku Thank you so much for your incredible work. I've been working on obtaining word-level confidence scores for different ASR tools, aiming to compare Whisper and Wav2vec2 models. However, I've noticed some differences in the confidence scores between the two models. For example:

With Wav2vec2 using the CTC-Segmentation algorithm, I obtained the following word-level confidence:
["cómo":0.000, "puedo":0.646, "ayudarte":0.455]

Using Whisper with DTW (via the Whisper-Timestamped library), I obtained the following word-level confidence:
["¿Cómo":0.869, "puedo":0.998, "ayudarte?":0.999]

I understand that Wav2vec2 and Whisper have distinct architectures, with Whisper not following the CTC loss, which makes a direct comparison challenging. Is there a method or approach I can use to ensure a meaningful comparison of word-level confidence scores between these two ASRs? It would be great if you can guide me on this.

how it works when i use my own CTC probabilities and char_list?

hi, I want to test CTC-segment on Chinese, so I use my own CTC probabilities which gets from acoustics model(CTC+LSTM) and the char_list(~2000 syllables, one Chinese character corresponds one syllable). But the result is incorrect：

the pcm file IC0773W0044-nosp.pcm is labled by "放一首邓丽君的我只在乎你", which corresponds to the syllable string"f_ang4 ii_i1 sh_ou3 d_eng4 l_i4 j_vn1 d_e0 uu_uo3 zh_i3 z_ai4 h_u1 n_i3". and the audio during time is 2.46s.
my script is align.py:

import numpy as np
import pyximport
pyximport.install(setup_args={"include_dirs":np.get_include()},build_dir="build", build_in_temp=False)
from align_fill import cython_fill_table
import sys
import torch
sys.path.append("../../../espnet")
from espnet.asr.asr_utils import get_model_conf
import os
from pathlib import Path
from time import time
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("model_path")
parser.add_argument("data_path")
parser.add_argument("eval_path")
parser.add_argument('--start_win', type=int, default=8000)
args = parser.parse_args()
max_prob = -10000000000.0

def align(lpz, char_list, ground_truth, utt_begin_indices, skip_prob):   
    #blank = 0
    blank = 2344 #my blank id
    print("Audio length: " + str(lpz.shape[0]))
    print("Text length: " + str(len(ground_truth)))
    if len(ground_truth) > lpz.shape[0] and skip_prob <= max_prob:
        raise AssertionError("Audio is shorter than text!")
    window_len = args.start_win

    # Try multiple window lengths if it fails
    while True:
        # Create table which will contain alignment probabilities
        print(lpz.shape[0],len(ground_truth),ground_truth.shape[0],ground_truth.shape[1])
        table = np.zeros([min(window_len, lpz.shape[0]), len(ground_truth)], dtype=np.float32)
        table.fill(max_prob)
        # Use array to log window offsets per character
        offsets = np.zeros([len(ground_truth)], dtype=np.int)

        # Run actual alignment
        t, c = cython_fill_table(table, lpz.astype(np.float32), np.array(ground_truth), offsets, np.array(utt_begin_indices), blank, skip_prob)
        #for i in table:
        #    print(' '.join(map(str,i)))
        print("Max prob: " + str(table[:, c].max()) + " at " + str(t))

        # Backtracking
        timings = np.zeros([len(ground_truth)])
        char_probs = np.zeros([lpz.shape[0]])
        char_list = [''] * lpz.shape[0]
        current_prob_sum = 0
        try:
            # Do until start is reached
            while t != 0 or c != 0:
                # Calculate the possible transition probabilities towards the current cell
                min_s = None
                min_switch_prob_delta = np.inf
                max_lpz_prob = max_prob
                for s in range(ground_truth.shape[1]): 
                    if ground_truth[c, s] != -1:                   
                        offset = offsets[c] - (offsets[c - 1 - s] if c - s > 0 else 0)
                        switch_prob = lpz[t + offsets[c], ground_truth[c, s]] if c > 0 else max_prob
                        est_switch_prob = table[t, c] - table[t - 1 + offset, c - 1 - s]
                        if abs(switch_prob - est_switch_prob) < min_switch_prob_delta:
                            min_switch_prob_delta = abs(switch_prob - est_switch_prob)
                            min_s = s

                        max_lpz_prob = max(max_lpz_prob, switch_prob)
                
                stay_prob = max(lpz[t + offsets[c], blank], max_lpz_prob) if t > 0 else max_prob
                est_stay_prob = table[t, c] - table[t - 1, c]
                
                # Check which transition has been taken
                if abs(stay_prob - est_stay_prob) > min_switch_prob_delta:
                    # Apply reverse switch transition
                    if c > 0:
                        # Log timing and character - frame alignment
                        for s in range(0, min_s + 1):
                            timings[c - s] = (offsets[c] + t) * 10 * 4 / 1000
                        char_probs[offsets[c] + t] = max_lpz_prob
                        char_list[offsets[c] + t] = train_args.char_list[ground_truth[c, min_s]]
                        current_prob_sum = 0

                    c -= 1 + min_s
                    t -= 1 - offset
                 
                else:
                    # Apply reverse stay transition
                    char_probs[offsets[c] + t] = stay_prob
                    char_list[offsets[c] + t] = "ε"
                    t -= 1
        except IndexError:
            # If the backtracking was not successful this usually means the window was too small
            window_len *= 2
            print("IndexError: Trying with win len: " + str(window_len))
            if window_len < 100000:
                continue
            else:
                raise
        break
    return timings, char_probs, char_list

def prepare_text(text):
    # Prepares the given text for alignment
    # Therefore we create a matrix of possible character symbols to represent the given text

    # Create list of char indices depending on the models char list
    ground_truth = "#"
    utt_begin_indices = []
    for utt in text:
        # Only one space in-between
        if ground_truth[-1] != " ":
            ground_truth += " "

        # Start new utterance remeber index
        utt_begin_indices.append(len(ground_truth.strip().split()) - 1)

        # Add chars of utterance
        for char in utt.strip().split():
            if char.isspace():
                if ground_truth.strip().split()[-1] != " ":
                    ground_truth += " "
            elif char in train_args.char_list and char not in [ ".", ",", "-", "?", "!", ":", "»", "«", ";", "'", "›", "‹", "(", ")"]:
                ground_truth += char

    # Add space to the end
    if ground_truth[-1] != " ":
        ground_truth += " "
    utt_begin_indices.append(len(ground_truth.strip().split()) - 1)
    print(ground_truth)
    # Create matrix where first index is the time frame and the second index is the number of letters the character symbol spans
    max_char_len = max([len(c) for c in train_args.char_list])
    # ground_truth_mat = np.ones([len(ground_truth), max_char_len], np.int) * -1    
    ground_truth_mat = np.ones([len(ground_truth.strip().split()), 1], np.int) * -1    
    for i in range(len(ground_truth.strip().split())):
        # for s in range(max_char_len):
        for s in range(1):
            if i-s < 0:
                continue
            span = ' '.join(ground_truth.strip().split()[i-s:i+1])
            # span = span.replace(" ", '▁')
            span = span.replace(" ", 'SP')
            print(span)
            if span in train_args.char_list:
                ground_truth_mat[i, s] = train_args.char_list.index(span)        
    print(ground_truth_mat)
    print(utt_begin_indices)
    return ground_truth_mat, utt_begin_indices

def write_output(out_path, utt_begin_indices, char_probs):
    # Uses char-wise alignments to get utterance-wise alignmentes and writes them into the given file
    with open(str(out_path), 'w') as outfile:
        outfile.write(str(path_wav.name) + '\n')
        def compute_time(index, type):
            # Compute start and end time of utterance.            
            middle = (timings[index] + timings[index - 1]) / 2
            if type == "begin":
                return max(timings[index + 1] - 0.5, middle)
            elif type == "end":
                return min(timings[index - 1] + 0.5, middle)

        for i in range(len(text)):
            start = compute_time(utt_begin_indices[i], "begin")
            end = compute_time(utt_begin_indices[i + 1], "end")
            start_t = int(round(start * 1000 / 40))
            end_t = int(round(end * 1000 / 40))
            # Compute confidence score by using the min mean probability after splitting into segments of 30 frames
            n = 30
            if end_t == start_t:
                min_avg = 0
            elif end_t - start_t <= n:
                min_avg = char_probs[start_t:end_t].mean()
            else:
                min_avg = 0
                for t in range(start_t, end_t - n):
                    min_avg = min(min_avg, char_probs[t:t + n].mean())                
            outfile.write(str(start) + " " + str(end) + " " + str(min_avg) + " | " + text[i] + '\n')

model_path = args.model_path
model_conf = None

# read training config
idim, odim, train_args = get_model_conf(model_path, model_conf)

#space_id = train_args.char_list.index('▁')
space_id = train_args.char_list.index('SP')
train_args.char_list[0] = "ε"
# train_args.char_list = [c.lower() for c in train_args.char_list]

data_path = Path(args.data_path)
eval_path = Path(args.eval_path)

for path_wav in data_path.glob("*.pcm"):
    chapter_sents = data_path / path_wav.name.replace(".pcm", ".txt")
    chapter_prob = eval_path / path_wav.name.replace(".pcm", ".npz")
    out_path = eval_path / path_wav.name.replace(".pcm", ".txt")
    with open(str(chapter_sents), "r") as f:
        text = [t.strip() for t in f.readlines()]
    lpz = np.load(str(chapter_prob))["arr_0"]
    print("Syncing " + str(path_wav))                    
    ground_truth_mat, utt_begin_indices = prepare_text(text)
    try:
        timings, char_probs, char_list = align(lpz, train_args.char_list, ground_truth_mat, utt_begin_indices, max_prob)
        print(timings)
    except AssertionError:
        print("Skipping: Audio is shorter than text")
        continue
    write_output(out_path, utt_begin_indices, char_probs)

I wonder where did i go wrong, or the ctc-segment is not suitable for Chinese syllable. Any suggestion is helpful for me! Thanks very much!

Installation fails

Defaulting to user installation because normal site-packages is not writeable
Collecting ctc-segmentation
Using cached ctc_segmentation-1.7.4.tar.gz (73 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: Cython in c:\users\zhx\appdata\roaming\python\python39\site-packages (from ctc-segmentation) (0.29.32)
Requirement already satisfied: numpy in c:\users\zhx\appdata\roaming\python\python39\site-packages (from ctc-segmentation) (1.23.4)
Requirement already satisfied: setuptools in c:\program files\python39\lib\site-packages (from ctc-segmentation) (56.0.0)
Building wheels for collected packages: ctc-segmentation
Building wheel for ctc-segmentation (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for ctc-segmentation (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [12 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-39
creating build\lib.win-amd64-cpython-39\ctc_segmentation
copying ctc_segmentation\ctc_segmentation.py -> build\lib.win-amd64-cpython-39\ctc_segmentation
copying ctc_segmentation\partitioning.py -> build\lib.win-amd64-cpython-39\ctc_segmentation
copying ctc_segmentation_init_.py -> build\lib.win-amd64-cpython-39\ctc_segmentation
running build_ext
building 'ctc_segmentation.ctc_segmentation_dyn' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for ctc-segmentation
Failed to build ctc-segmentation
ERROR: Could not build wheels for ctc-segmentation, which is required to install pyproject.toml-based projects

Obtaining unusual alignment results while using the ESPnet2 Branchformer model.

Firstly, I want to express my admiration for the exceptional work accomplished here!

Recently, I've been facing the issue while using the ESPnet2 Branchformer model.
Despite following the instructions on here, I encountered poor alignment results.
This results occurred when I trained the model with phone-level transcriptions.

To understand this issue further, I experimented with two different token types, the details of which are as follows:
The accuracy of the two models is 95+%.

BPE-level tokens:

Phone-level tokens:

I would appreciate your guidance and insights to help me resolve these alignment issues.

Thank you in advance.
Tien-Hong

What do timings denote

Hi, do timings that are returned from the ctc_segmentation() function denote the time when the corresponding character starts or the time when it is in the middle of that character (highest probability).