m-bain / whisperx Goto Github PK

View Code? Open in Web Editor NEW

10.1K 10.1K 1.1K 23.47 MB

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

License: BSD 2-Clause "Simplified" License

Python 100.00%

asr speech speech-recognition speech-to-text whisper

whisperx's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes clashsan laheef yuan-manx amadeuzou entn-at shaun95 yasutak loretoparisi tarun166051 daudxu chenchy hyojunguy everythingneurodiversity holynuts mbrukman mahenrathod knoxgon xiaowei6688 jrcribb richardsonjf nanderoo allen-oneill maxmax2016 josh-zhu egorsmkv lukasjhan kang7367 kohnhirn forkoooor rogervaas tohe91 vasistalodagala algon-33 agileedulabs jasonchang0905 reppy4620 eltociear michaeldoron ishine crywas naoto243 felippechemello aosfatos shigabeev andupotorac mu4farooqi infinitay proxytype achyun seanth mehmetbal binyousif jebdu sudo-jarvis jay2jp acul3 bmox barabazs arcticlinux faisal00813 ohhkaneda sagoyanfisic ruohoruotsi fastflair ammar1c ricklentz arodma80ia mattiaswickberg jonatasgrosman javierpa95 ovidb jacksonkarel hendrikgrobler wilfoderek hobispl mu-l sundsea lucylow ganjunhong alexgo84 ajilim metafaith zhoujiafu ukuteam tequip discoverylabske astraliteheart egochao sachin1988shah pikauba watsonindustries bingtian88 lucholoky zoudong ext666 smly ambiki trisix imsumitkumar

whisperx's Issues

Steps to transcribe in French

Hi,

Thanks for sharing this work. You wrote that it still needs testing... can I test it in French 😉?
I am not sure what I should change. I saw that the wav2vec2 model could be passed in as parameter (see the readme), but in code there are some harcoded pipelines refering to the the english model. For French there is a wav2vec2 model for French (never tested since I was relying on Whisper only).

Looking forward to testing this in French!

pyannote.audio use guarded in code but not setup

pyannote.audio brings some great features but it is rightfully guarded in code since it requires additional credentials. Additionally it appears to be currently incompatible with the newest macs meaning install fails when it is included. The recommendation here would be to have the additional features it brings be an extra instead of part of the default install. Maybe just add:

#added to the setup() call in setup.py
#additionally, it should be removed from the requirements.txt
#I've never been good at naming things.
extras_require={"pyannote": ["pyannote.audio"]},

alternatively, the code can remove it from the setup entirely and detect, and gracefully handle, if it isn't installed when VAD and diarization are called. This may be preferred since decoupling the diarization/vad implementation could be useful in the future when there are multiple valid options.

word level confidences

Hi,

I was trying out your package, it seems like a pretty useful addition to whisper.

I was wondering if you had any plans to add word level confidences / logprobs, eg similar to this ticket?
openai/whisper#284

Thanks!

Any plans for Python code usage (instead of CLI)?

Thanks for this wonderful effort. Do you have any plans to enable Python code usage (instead of CLI)?

whisperx: error: unrecognized arguments: --vad_filter --parallel_bs 16

Hi,
I get this error message when trying to run whisperx with the latest features (VAD and parallel processing).
I have already upgraded to the latest version with all the requirements being satisfied...
Any clue on what it is missing/wrong in my installation?
Thanks&Regards,

Convert arabic numerals and symbols to phonetic form

Currently arabic numerals and symbols in whisper transcript cannot be aligned, needs to be phonetic alphabet.

Need to perform inverse of normalization in https://github.com/m-bain/whisperX/blob/main/whisperx/normalizers/english.py

Such that numbers and currencies are converted to their phonetic word form.

E.g.
"$300" -> "three hundred dollars"

To perform wav2vec alignment.

Then convert back to symbol form, and assign timestamps.

Error when trying to use the pt align model.

I saw that support for portuguese was added a few commits ago and decided to give it a go. But when loading the align model this error happens:

ValueError: The chosen align_model "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese" could not be found in huggingface (https://huggingface.co/models) or torchaudio (https://pytorch.org/audio/stable/pipelines.html#id14)

Support inference from fine-tuned 🤗 transformers Whisper models

Hi @m-bain,

This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions.
I was wondering if you'd like to extend the current transcription codebase to also support transformers fine-tuned Whisper checkpoints as well.

For context, we recently ran a Whisper fine-tuning event powered by 🤗 transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard

In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.

I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from 🤗 transformers side. :)

Cheers,
VB

Diarized transcription in .txt file?

The diarize version just puts the audio in an .ass file atm as far as I can see - is it possible in the current version to get the diarized output in a txt file?

Been trying this out with KBLab/wav2vec2-large-voxrex-swedish for Swedish, and while I lack the hardware for extensive testing atm, it seems to be working fine.

Running on short audio: KeyError: 'level_1'

Failed to align segment: no characters in this segment found in model dictionary, resorting to original...
I was testing to align a audio file, but it didn't worked and give above error. It was a plain English .wav file

how to output joined segments for a longer on-screen session?

generated .ass is giving constant full segments with occasional word highlighting. how to make it so the segments are joined for a longer on-screen session?

ty @m-bain, looking forward to a sweet built-in diarization implementation! This will also be very useful for quick preview of audio sound clips when searching.

Explanation of use-case:

Would like to see if above question is easily achieved because my eyes follow the auto-generated captions on youtube videos, the scrolling, like rsvp (https://en.wikipedia.org/wiki/Rapid_serial_visual_presentation) and I only watch videos for that reason. It is sort of a speed reading-like tool, but instead of presenting one word at a time, which can be nausating, a more relaxed method is possible, where you can look at old text.

These .ass files generated are possible to view with audio files on android 12 mpv, when renaming them to .srt.
A display that can hold more text than 2 lines (unlike youtube) would be ideal.

Add Arabic and Multlingual processing

when file start with a ~10sec non speech timestamp are totally wrong

I tried to run WhisperX on a file that starts with nonspeech - the initial 13 sec of the file are nonspeech.
The results WhisperX provided me ignored that silent time and start the first word in 00:00 instead of 00:14

Any Idea?

OOM && Large GPU usage

Hi, I am using this, but it doesn't seems to release the gpu vram. I am using large-v2 model and it's using 16gigs of GPU ram. Which I don't think is NORMAL at all, moreover it doesn't free up the ram. Is this normal?

Translate like English to others Language

Currently supported only English translation
How to do translation in others Language ?

The result is different with execute multiple times

1.command：
whisperx ./audio/diff-result-times.mp4 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --output_dir ./audio/ --language en --model small.en

2.audio：

diff-result-times.mp4

3.different results, for example：

Use of fine-tuned Whisper

What a great tool! Is it somehow possible to use also a version of Whisper that has been fine-tuned? I have one model trained with transformers on Hugging Face.
Thanks.

Error when adding torch, and also how to add --diarize/parameters when using python

I have cloned WhisperX and I'm getting a syntax error, about getting torch 1.8. But it seems like there is a syntax at error "torch (>=1.8.)" when ran, it should probably be replaced with "torch >=1.8.0" or "torch >=1.8,<1.9" if torch 1.9 also works.

epkg_resources.extern.packaging.requirements.InvalidRequirement: Expected closing RIGHT_PARENTHESIS - torch (>=1.8.*)

I'm also interested in learning how to add parameters in python.

Got IndexError: list index out of range

At the Performing alignment... part got IndexError: list index out of range error
whisperx --model large --task translate --language ja movie.mp4 --align_model jonatasgrosman/wav2vec2-large-xlsr-53-japanese

Timestamp issue

I noticed that when I transcribe videos the subtitles aren't displayed anymore. Apparently the start and end timestamp are much closer together than before. I noticed this for all my new transcriptions.

This is a comparison between a transcription I did with the same settings in December vs now
red: old transcription
green: new transcription

--- <unnamed>
+++ <unnamed>
@@ -1,55 +1,55 @@
 WEBVTT
 
-00:06.854 --> 00:09.218
+00:06.854 --> 00:06.874
 REDACTED
 
-00:10.747 --> 00:10.990
+00:10.747 --> 00:10.869
 REDACTED
 
-01:30.038 --> 01:31.039
+01:30.038 --> 01:30.098
 REDACTED
 
-01:32.560 --> 01:36.980
+01:32.560 --> 01:32.600
 REDACTED
 
-01:37.100 --> 01:37.685
+01:37.100 --> 01:37.141
 REDACTED
 
-01:39.860 --> 01:42.014
+01:39.860 --> 01:39.900
 REDACTED
 
-01:42.960 --> 01:43.124
+01:42.960 --> 01:43.062
 REDACTED
 
-01:44.538 --> 01:44.620
+01:44.538 --> 01:44.559
 REDACTED
 
-01:45.820 --> 01:47.458
+01:45.820 --> 01:45.861
 REDACTED
 
-01:47.680 --> 01:49.620
+01:47.680 --> 01:47.741
 REDACTED
 
-01:49.660 --> 01:51.518
+01:49.660 --> 01:49.761
 REDACTED
 
-01:54.140 --> 01:57.878
+01:54.140 --> 01:54.301
 REDACTED
 
-01:58.400 --> 02:03.240
+01:58.400 --> 01:58.541
 REDACTED
 
-02:18.761 --> 02:20.514
+02:18.761 --> 02:18.781
 REDACTED
 
-02:21.280 --> 02:23.316
+02:21.280 --> 02:21.381
 REDACTED
 
-02:25.300 --> 02:27.217
+02:25.300 --> 02:25.361
 REDACTED
 
-02:27.600 --> 02:32.379
+02:27.600 --> 02:27.620
 REDACTED
 
-02:32.840 --> 02:34.756
+02:32.840 --> 02:32.921
 REDACTED

amazing project

hello team, thank you for this project!

Just wondering is there a possibility to incorporate whisper mic for real time processing ?

Phoneme-level timestamps

Great work!
Is it possible to expand this to phoneme-level timestamps, instead of word-level timestamps?

For example, instead of
"[00:13.50->00:13.60] smiles"
have
"[00:13.50->00:13.53] s
[00:13.53->00:13.57] mi
[00:13.57->00:13.58] l
[00:13.58->00:13.60] es"

Index out-of-range access to word_segment_list

@m-bain Thank you for whisperX!

In some audio, I found IndexError: list index out of range error happens at aligning stage.

Traceback (most recent call last):                                                                            
  File "/home/syoyo/miniconda3/envs/whisperx/bin/whisperx", line 8, in <module>                               
    sys.exit(cli())                                                                                           
  File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 505, in
 cli                                                                                                          
    result_aligned = align(result["segments"], align_model, align_metadata, audio_path, device,              
  File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 374, in
 align                                                                                                        
    word_segments_list[-1]['text'] += ' ' + curr_word                                                        
IndexError: list index out of range

How to reproduce

https://commonvoice.mozilla.org/en/datasets

Download Japanese -> Common Voice Copus 12.0

$ whisperx --model large --language ja cv-corpus-12.0-2022-12-07/ja/clips/common_voice_ja_35797612.mp3

Investigation

whisperX/whisperx/transcribe.py

Line 382 in 2aa074e

word_segments_list[-1]['text'] += ' ' + curr_word

Here is the dump of t_local and t_words before for x in range(len(t_local)):

t_local = [None, None, (0.842688, 1.083456), (1.424544, 1.5048), (1.524864, 1.544928), (1.665312, 1.7455679999999998), (1.7656319999999999, 1.785696), (1.865952, 2.026464), (2.046528, 2.186976), (2.20704, 2.2271039999999998), (2.247168, 2.327424), (2.347488, 2.367552), (2.427744, 2.508), (2.548128, 2.568192), (2.688576, 2.969472), None, (3.089856, 3.10992), (3.129984, 3.2102399999999998), (3.31056, 3.8322239999999996), (3.852288, 3.9325439999999996), (4.052928, 4.072992), (4.133184, 4.2535680000000005), (4.333824, 4.434144), (4.494336, 4.715039999999999), (4.735104, 4.755168), (4.775232, 5.196576), (5.31696, 5.457407999999999), (5.597856, 5.678112), (5.738303999999999, 5.81856), (5.838624, 5.898815999999999), (5.9790719999999995, 6.0793919999999995), (6.159648, 6.199776), (6.21984, 6.280032), (6.360288, 6.460608000000001), (6.480671999999999, 6.5207999999999995), (6.641183999999999, 6.761568), (6.781632, 6.801696), None]
t_words = ['童', '貞', '助', 'か', 'ら', 'な', 'い', 'と', '思', 'っ', 'て', 'い', 'る', 'か', 'ら', '、', 'い', 'る', 'と', 'う', 'っ', 'と', 'さ', 'り', 'っ', 'と', '音', 'が', 'し', 'て', '目', 'か', 'ら', '火', 'が', '出', 'た', '。']

Probably we need to consider a situation t_local may contain multiple None s from the start?

Needed steps to add a new language

Hi, do I just find a compatible Wav2Vec2 model for the language and add it to the models list or any other steps needed? also is there are any modifications needed for RTL languages such as Arabic?

Length of the written text

If someone will speak too long than the whole text will appear throught the entire screen at that timeframe so is there any option to cut this in the pieces?

Can't process MP3 files with VAD or diarization

Hi @m-bain. Thank you so much for your amazing work!

I wanted to test your new VAD feature, but I get an error stating that:

Traceback (most recent call last):
  File "/usr/local/bin/whisperx", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 451, in cli
    result = transcribe_with_vad(model, audio_path, vad_pipeline, temperature=temperature, **args)
  File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 310, in transcribe_with_vad
    vad_segments = vad_pipeline(audio)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/pipeline.py", line 238, in __call__
    return self.apply(file, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/pipelines/voice_activity_detection.py", line 197, in apply
    segmentations: SlidingWindowFeature = self._segmentation(file)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/inference.py", line 328, in __call__
    waveform, sample_rate = self.model.audio(file)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/io.py", line 278, in __call__
    waveform, sample_rate = torchaudio.load(file["audio"])
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/backend/soundfile_backend.py", line 205, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1183, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'xxxx.mp3': File contains data in an unknown format.

I've tried with different (valid) mp3 files and each time it results in this error.

I'm using whisperx on Google Collab.

// EDIT: I get the same error when I try the diarization feature.

Does it support Chinese?

Hi,

I couldn't find align model used for Chinese in "torchaudio.pipelines". Does it support Chinese?

Thanks

Edit: HFValidationError + more

I can use the --diarize alone and it works, but if I add the --vad_filter ill get this error.

huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'pyannote\segmentation'.

Which doesnt make sense since the hf_token is validated when diarizing only. Aaaaaaand its here while writing this, that I just realized that the segmentation is used to improve the diarization and is another repository.

Also a simple cross-takling answer like yes and no, is often under 300 ms long. Pyannote only accurately detects speaker change every 2 seconds of spoken audio, which is almost 7 times slower than what is needed if we want to use as little "logic" as possible as I call it.

"Logic": Whisper almost always detects when a new sentence is needed, also if its an answer to a question. So having the timestamps of the spoken word(s) rerun the diarization, but with empty space in the beginning or as I do double the length of the audio file and then divide the time. Will 90% be treated as another speaker than the one who spoke before. But it still cant accurately detect who is talking and would just create a new speaker eg. Speaker_03, if its only 2 people who where talking. This means that there is a lot of annoying, but simple steps in listening to each unknown speaker and labelling them correctly, I would say that with my method the accuracy of guessing the speaker in the streched audio is around 50%, so not the worst.

I also briefly checked out on a few stats from Nvideo NeMo diarize/segments, and it seems to be a tiny bit better at handling switching speakers. But accuracy drops drastically when its nearing the 0,5-1 second mark (40% I think I saw.) But then again maybe the same method of strecthing the audio would help a bit, like it does with Pyannote. The bad thing with Nvidia NeMo is that it can only be trained with gpus that support CUDA or maybe can only run great on CUDA gpus, im not sure and havent tried as I only have an RX 580 8GB and a Ryzen 2700x.

Also maximum segment length in characters would be nice as when you get past the 70 charactersit begins to fill a lot on the subtitle floor. Like this port that uses --max-len "characters" too limit it very well.

Now I would not call myself a programmer, I just play around with things and cross my fingers that I made it work. I will look into It myself and maybe make a pull request if, I can improve or implement something. 😃

Is it possible to add "Korean" language?

Thank you very much.

VAD filter can create very large segments which can cause GPU OOM. Need to slice up long VAD segments

--vad_filter can sometimes cause GPU OOM due to input segments beeing too large.
Need to divide VAD segments which are too long and cut them up into smaller segments

Further adjust word timestamps by seeking nearest local volume minimum (quiet/silence maximum)

I'm trying it out in my code alongside WhisperX, might want it built-in.

Still some incoherent timestamps in the srt file

@m-bain Thank you so much for your amazing work.
There are still some incoherent timestamps in the word-level srt files (it was the case for 139 files out of 360 on my data). I'm about to write a Python script to parse all the srt files and fix the concerned timestamps, but maybe there is a way to avoid them from the beginning? It makes it hard to convert them into TextGrid files... (I use https://github.com/rctatman/SrtToTextgrid ) Beside that, Whisperx is working so well!!

Error after updating to pandas commit.

I keep getting this error after the new commit to Pandas:

File "//whisper.py", line 47, in whisper
result_aligned = whisperx.align(segments, model_a, metadata, audio_file, device)
File "/usr/local/lib/python3.9/site-packages/whisperx/alignment.py", line 309, in align
word_segments_arr["segment-text-start"] = per_word_grp["level_1"].min().reset_index()["level_1"]
File "/usr/local/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1416, in getitem
return super().getitem(key)
File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 248, in getitem
raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: level_1'

How to pass in traditional whisper CLI parameters?

I cannot find this anywhere in the documentation. In the whisperx transcribe function there is a massive section of optional parameters that can be passed in. How can I actually use these in python?

    # parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
    # parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
    # parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
    # # alignment params
    # parser.add_argument("--align_model", default=None, help="Name of phoneme-level ASR model to do alignment")
    # parser.add_argument("--align_extend", default=2, type=float, help="Seconds before and after to extend the whisper segments for alignment")
    # parser.add_argument("--align_from_prev", default=True, type=bool, help="Whether to clip the alignment start time of current segment to the end time of the last aligned word of the previous segment")
    # parser.add_argument("--drop_non_aligned", action="store_true", help="For word .srt, whether to drop non aliged words, or merge them into neighbouring.")

    # parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
    # parser.add_argument("--output_type", default="srt", choices=['all', 'srt', 'vtt', 'txt'], help="File type for desired output save")

    # parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")

    # parser.add_argument("--task", type=str, default="transcribe", choices=["transcribe", "translate"], help="whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate')")
    # parser.add_argument("--language", type=str, default=None, choices=sorted(LANGUAGES.keys()) + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), help="language spoken in the audio, specify None to perform language detection")

    # parser.add_argument("--temperature", type=float, default=0, help="temperature to use for sampling")
    # parser.add_argument("--best_of", type=optional_int, default=5, help="number of candidates when sampling with non-zero temperature")
    # parser.add_argument("--beam_size", type=optional_int, default=5, help="number of beams in beam search, only applicable when temperature is zero")
    # parser.add_argument("--patience", type=float, default=None, help="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search")
    # parser.add_argument("--length_penalty", type=float, default=None, help="optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default")

    # parser.add_argument("--suppress_tokens", type=str, default="-1", help="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations")
    # parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")
    # parser.add_argument("--condition_on_previous_text", type=str2bool, default=False, help="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop")
    # parser.add_argument("--fp16", type=str2bool, default=True, help="whether to perform inference in fp16; True by default")

    # parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
    # parser.add_argument("--compression_ratio_threshold", type=optional_float, default=2.4, help="if the gzip compression ratio is higher than this value, treat the decoding as failed")
    # parser.add_argument("--logprob_threshold", type=optional_float, default=-1.0, help="if the average log probability is lower than this value, treat the decoding as failed")
    # parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")
    # parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")

Where would I actually put this? Transcribe does not seem to have an input parameter for these, neither does load_model.

    model = whisperx.load_model(modelSize, device)
    result = model.transcribe(audio_file)
    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

Specifically interested in --threads, --beam_size, --patience, and --best_of

Suggested enhancement: increase length of resulting segments

I assume that the resulting segments resemble the black lines in the image. In the image, the segments have been lengthened at the beginning (red line), and at the end (green line).

Is it possible to add parameters for the optional lengthening of all the resulting segments at the beginning and at the end?

[parameter 1:] The red line is an example of the result of a parameter to lengthen all segments at the beginning (in milliseconds).

[parameter 2:] The green line is an example of the result of a parameter to lengthen all segments at the end (in milliseconds).

[parameter 3:] A third parameter is for the minimum distance (in milliseconds) between the end of a (lengthened) segment and the beginning of the next (lengthened) segment. This third parameter may overrule the other two parameters to prevent a collision.

Korean Model Issues with Alignment and Transcription

Terminal (Main) align_extend 2

E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.360]  안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:03.360 --> 00:09.120]  아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:09.120 --> 00:18.240]  그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:18.240 --> 00:23.160]  영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:23.160 --> 00:24.160]  예를 들어,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:30.920 --> 01:34.580]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.320]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.320 --> 01:41.000]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:41.000 --> 01:44.560]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.560 --> 01:45.600]  가능할까?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:49.600 --> 01:54.720]  나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.720 --> 01:59.680]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.720 --> 02:32.720]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.720 --> 02:37.320]  마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:43.920]  집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:47.020 --> 02:49.020]  TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520]  Bye!
Performing alignment...
[00:00.000 --> 00:00.502]  안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:01.360 --> 00:02.021]  아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:07.120 --> 00:08.282]  그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:16.240 --> 00:16.821]  영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:21.160 --> 00:21.260]  예를 들어,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:28.920 --> 01:29.522]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.320 --> 01:36.781]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:39.000 --> 01:39.501]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.560 --> 01:42.640]  가능할까?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:47.600 --> 01:48.161]  나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.720 --> 01:53.321]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.720 --> 02:23.581]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.720 --> 02:30.840]  마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:45.020 --> 02:45.160]  TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520]  Bye!

E:\Applications\WhisperX>

Terminal (Not described below) No align_extend parameter

E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.320]  안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:03.320 --> 00:09.060]  아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:09.060 --> 00:18.280]  그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:18.280 --> 00:23.100]  영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:23.100 --> 00:24.200]  예를 들어,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:52.480 --> 00:58.720]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:03.220 --> 01:08.440]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:30.840 --> 01:34.580]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.280]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.280 --> 01:40.920]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:40.920 --> 01:44.520]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.520 --> 01:45.880]  가능할까?
[01:49.880 --> 01:54.920]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.920 --> 01:59.720]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.920 --> 02:32.920]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.920 --> 02:37.360]  마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:44.000]  집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:47.000 --> 02:49.000]  TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500]  Bye!
Performing alignment...
[00:00.000 --> 00:00.482]  안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:01.340 --> 00:02.062]  아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:07.060 --> 00:08.102]  그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:16.280 --> 00:16.801]  영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:21.100 --> 00:21.220]  예를 들어,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:01.220 --> 01:01.821]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:28.840 --> 01:29.442]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.280 --> 01:36.761]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:38.920 --> 01:39.441]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.520 --> 01:42.600]  가능할까?
[01:47.880 --> 01:48.481]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.920 --> 01:53.501]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.920 --> 02:23.721]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.920 --> 02:31.040]  마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:45.000 --> 02:45.120]  TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500]  Bye!

E:\Applications\WhisperX>

Environment

OS: Windows 10
Python: 3.9.9
WhisperX: e909f2f
Whisper Model: Large
Alignment Model: w11wo/wav2vec2-xls-r-300m-korean

Input

https://www.youtube.com/watch?v=GM2Ki_FmF5U (720p version, audio + video, mp4, I let WhisperX preprocess it)

WhisperX Command

whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2

Issues

Note: Everything below is described using align_extend 2 as shown in the command above and the Terminal (Main) details above

Translating English to Korean when I don't want it to

In the input, the speaker uses a mix of English and Korean. In the video's introduction, English is spoken and later in the video, such as during the example sentences, he switches to Korean. Instead of returning the introduction in English, you can see that it instead translated the English sentences into Korean for some reason.

This behavior is inconsistent too. For example, at 0:00:24, English is spoken and WhisperX transcribes it in English. That is fine. However, as mentioned above, during the introduction it transcribed it from English into Korean. I have no clue why that is.

Alignment Issues

EDIT: SOLVED I remembered that in #7 there was mention of having to us Chinese instead of cn, and so took a look at the issue again and saw it was regarding alignment. I changed kr to Korean for the language parameter and this issue was resolved.

At first I tried passing in no align_extend parameter. That made the transcribed captions even worse. I then used align_extend 2 as given in the Japanese example in the README which improved my results. It made everything better, and the English portions are lined up. However, the issue is the first occurrence (0:00:52) of the following line below is not aligned properly at all:

나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.

The transcription is correct. The issue is the alignment. As you can see from the output of the command line, initially everything is correct before it performs the alignment

[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.

However, after the alignment, it seems like of the Korean sentences' alignments are less than a second long. Here is just one example, but if you look at the output above, you will notice it's the case for all Korean sentences post-alignment.

[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.

Working with the new VAD Feature

I'm currently trying to work with the new VAD feature but I'm getting the following error:

TypeError: transcribe_with_vad() missing 1 required positional argument: 'vad_pipeline'

Is there sample code anywhere for transcribing with vad?

Explore the usage of multilingual models

I think we need to explore multilingual models such as wav2vec2-xls-r-300m-21-to-en to see if the 300M models are better than the 53M models currently used for low resource languages and see if we could use a single model for multilingual alignment

I have the code written but not thoroughly tested so I might share it after the merging of #53 but I wanted to hear your thoughts about this first

missing pip install soundfile

Thank you for your work but unless one also does 'pip install soundfile' during the installation of this they will get an error during initialization of WhiperX.

I would suggest adding that step to the readme as even though it doesn't stop operation it does throw a warning.

Add option to only align text to audio

Would be nice, if there was a flag to pass a srt or other text file directly to the alignment, skipping the text recognition.
E.g. sometimes I already have a ground truth transcription and I only want to align it to the audio.
Is this possible?

Low Accuracy on German

I wanted to use WhisperX to do forced alignment on the Mozilla Common Voice German Dataset, but the words are often cut of or the segments do not align at all.

Additionally, some audio tracks are recognized as Farsi instead of German.

Is it because of the short duration of these clips (< 2-5 seconds, each)?
And how can I improve this accuracy?

Is the accuracy of the english models (for english audio) better?

Enhancement: Possible to determine different speakers?

Is it possible to add the functions of determining the different speakers in a conversation and identifying them in the subtitles?

Failed to align

Thank you for this rep first!
When I tried to align an audio, I met this error.

File"/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align "/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align
Any idea to fix this?

Where is word-level srt file？

missing diarize.py file

the most recent commit 286a2f2 references a diarize.py file which doesn't exist

whisperX/whisperx/transcribe.py

Line 12 in 286a2f2

from .diarize import assign_word_speakers, Segment

is it going to be added in a future commit?

"word-level" doesn't always exist

I tried to translate from japanese to english and use whisperx.
In exactly one entry it's missing a "word-level" in it's alignment dict, causing utils.write_ass to fail.

Alignment output entry:

    {
        "id": 654,
        "seek": 518644,
        "start": 5198.74,
        "end": 5198.7404,
        "text": " Shizu",
        "tokens": [
            1160,
            590,
            84
        ],
        "temperature": 1,
        "avg_logprob": -4.464975124452172,
        "compression_ratio": 1.0123456790123457,
        "no_speech_prob": 0.07407991588115692
    },

Align Models and Fine-Tuning

I'm trying to use whisperX on the Korean language and came across some issues on how to do so. Since there's no default model, I went over to HF to find a model to use. As expected, there are many models and I'm not too sure which one to choose, especially because some of the recent models trained by slplab are lacking evaluation results. Also, not really sure what they are evaluated on or any other models. Personally I am not familiar with wav2vec2 or the evaluation benchmark it uses let alone what these other trained models are benchmarked against.

Let me get back on tangent, does it matter if the model I select was fine-tuned on wav2vec2-large-xlsr-53? I'm asking because the current default Japanese model was fine-tuned on it and is now the default. For Chinese, another fine-tuned model on wav2vec2-large-xlsr-53 was again selected according to #7.

Do I have to choose a model like slplab/wav2vec2-large-xlsr-53-korean-samsung-54k or could I use thisisHJLee/wav2vec2-large-xls-r-1b-korean-sample5?

Torch: Kernel size can't be greater than actual input size

For a longer file on M1 Pro, I keep getting this error after about 22min of alignment:

Reproduction repo: https://github.com/akparhi/pyvtt/tree/dev
Request with 1hr youtube video: curl --location --request POST 'http://localhost:8000/speech-to-text' \ --header 'Content-Type: application/json' \ --data-raw '{ "url": "https://www.youtube.com/watch?v=MYrfLmm_cT4", "accuracy": "word", }'

Is there a function within WhisperX that can generate an SRT file?

instructions on updating

You provide installation instructions. It would be nice to mention how to update in the same area of the readme so that people who know enough to follow the install line can also be given the information to update. (even if that is just to perform the same command again)