Giter Site home page Giter Site logo

m-bain / whisperx Goto Github PK

View Code? Open in Web Editor NEW
9.0K 118.0 935.0 23.5 MB

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

License: BSD 4-Clause "Original" or "Old" License

Python 100.00%
asr speech speech-recognition speech-to-text whisper

whisperx's Introduction

WhisperX

GitHub stars GitHub issues GitHub license ArXiv paper Twitter

whisperx-arch

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

  • โšก๏ธ Batched inference for 70x realtime transcription using whisper large-v2
  • ๐Ÿชถ faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
  • ๐ŸŽฏ Accurate word-level timestamps using wav2vec2 alignment
  • ๐Ÿ‘ฏโ€โ™‚๏ธ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
  • ๐Ÿ—ฃ๏ธ VAD preprocessing, reduces hallucination & batching with no WER degradation

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.

Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

New๐Ÿšจ

  • 1st place at Ego4d transcription challenge ๐Ÿ†
  • WhisperX accepted at INTERSPEECH 2023
  • v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
  • v3 released, 70x speed-up open-sourced. Using batched whisper with faster-whisper backend!
  • v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
  • Paper drop๐ŸŽ“๐Ÿ‘จโ€๐Ÿซ! Please see our ArxiV preprint for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.

Setup โš™๏ธ

Tested for PyTorch 2.0, Python 3.10 (use other versions at your own risk!)

GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the CTranslate2 documentation.

1. Create Python3.10 environment

conda create --name whisperx python=3.10

conda activate whisperx

2. Install PyTorch, e.g. for Linux and Windows CUDA11.8:

conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

See other methods here.

3. Install this repo

pip install git+https://github.com/m-bain/whisperx.git

If already installed, update package to most recent commit

pip install git+https://github.com/m-bain/whisperx.git --upgrade

If wishing to modify this package, clone and install in editable mode:

$ git clone https://github.com/m-bain/whisperX.git
$ cd whisperX
$ pip install -e .

You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.

Speaker Diarization

To enable Speaker Diarization, include your Hugging Face access token (read) that you can generate from Here after the --hf_token argument and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1 (if you choose to use Speaker-Diarization 2.x, follow requirements here instead.)

Note
As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see this issue for more details and potential workarounds.

Usage ๐Ÿ’ฌ (command line)

English

Run whisper on example segment (using default params, whisper small) add --highlight_words True to visualise word timings in the .srt file.

whisperx examples/sample01.wav

Result using WhisperX with forced alignment to wav2vec2.0 large:

sample01.mp4

Compare this to original whisper out the box, where many transcriptions are out of sync:

sample_whisper_og.mov

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

To label the transcript with speaker ID's (set number of speakers if known e.g. --min_speakers 2 --max_speakers 2):

whisperx examples/sample01.wav --model large-v2 --diarize --highlight_words True

To run on CPU instead of GPU (and for running on Mac OS X):

whisperx examples/sample01.wav --compute_type int8

Other languages

The phoneme ASR alignment model is language-specific, for tested languages these models are automatically picked from torchaudio pipelines or huggingface. Just pass in the --language code, and use the whisper --model large.

Currently default models provided for {en, fr, de, es, it, ja, zh, nl, uk, pt}. If the detected language is not in this list, you need to find a phoneme-based ASR model from huggingface model hub and test it on your data.

E.g. German

whisperx --model large-v2 --language de examples/sample_de_01.wav
sample_de_01_vis.mov

See more examples in other languages here.

Python usage ๐Ÿ

import whisperx
import gc 

device = "cuda" 
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

Demos ๐Ÿš€

Replicate (large-v3 Replicate (large-v2 Replicate (medium)

If you don't have access to your own GPUs, use the links above to try out WhisperX.

Technical Details ๐Ÿ‘ทโ€โ™‚๏ธ

For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint paper.

To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):

  1. reduce batch size, e.g. --batch_size 4
  2. use a smaller ASR model --model base
  3. Use lighter compute type --compute_type int8

Transcription differences from openai's whisper:

  1. Transcription without timestamps. To enable single pass batching, whisper inference is performed --without_timestamps True, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
  2. VAD-based segment transcription, unlike the buffered transcription of openai's. In Wthe WhisperX paper we show this reduces WER, and enables accurate batched inference
  3. --condition_on_prev_text is set to False by default (reduces hallucination)

Limitations โš ๏ธ

  • Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "ยฃ13.60" cannot be aligned and therefore are not given a timing.
  • Overlapping speech is not handled particularly well by whisper nor whisperx
  • Diarization is far from perfect
  • Language specific wav2vec2 model is needed

Contribute ๐Ÿง‘โ€๐Ÿซ

If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

TODO ๐Ÿ—“

  • Multilingual init

  • Automatic align model selection based on language detection

  • Python usage

  • Incorporating speaker diarization

  • Model flush, for low gpu mem resources

  • Faster-whisper backend

  • Add max-line etc. see (openai's whisper utils.py)

  • Sentence-level segments (nltk toolbox)

  • Improve alignment logic

  • update examples with diarization and word highlighting

  • Subtitle .ass output <- bring this back (removed in v3)

  • Add benchmarking code (TEDLIUM for spd/WER & word segmentation)

  • Allow silero-vad as alternative VAD option

  • Improve diarization (word level). Harder than first thought...

Contact/Support ๐Ÿ“‡

Contact [email protected] for queries.

Buy Me A Coffee

Acknowledgements ๐Ÿ™

This work, and my PhD, is supported by the VGG (Visual Geometry Group) and the University of Oxford.

Of course, this is builds on openAI's whisper. Borrows important alignment code from PyTorch tutorial on forced alignment And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio

Valuable VAD & Diarization Models from [pyannote audio][https://github.com/pyannote/pyannote-audio]

Great backend from faster-whisper and CTranslate2

Those who have supported this work financially ๐Ÿ™

Finally, thanks to the OS contributors of this project, keeping it going and identifying bugs.

Citation

If you use this in your research, please cite the paper:
@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}

whisperx's People

Contributors

abcods avatar amolinasalazar avatar aramlang avatar arnavmehta7 avatar awerks avatar barabazs avatar boulaouaney avatar canoalberto avatar carnifexer avatar cococig avatar davidas1 avatar egorsmkv avatar invisprints avatar jim60105 avatar jkukul avatar kurianbenoy avatar m-bain avatar mahmoudashraf97 avatar marcuskbrandt avatar mlopsengr avatar remic33 avatar smly avatar sohaibanwaar avatar sorgfresser avatar tengdahan avatar thebys avatar thomasmol avatar tijszwinkels avatar valentt avatar yasutak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whisperx's Issues

Timestamp issue

I noticed that when I transcribe videos the subtitles aren't displayed anymore. Apparently the start and end timestamp are much closer together than before. I noticed this for all my new transcriptions.

This is a comparison between a transcription I did with the same settings in December vs now
red: old transcription
green: new transcription

--- <unnamed>
+++ <unnamed>
@@ -1,55 +1,55 @@
 WEBVTT
 
-00:06.854 --> 00:09.218
+00:06.854 --> 00:06.874
 REDACTED
 
-00:10.747 --> 00:10.990
+00:10.747 --> 00:10.869
 REDACTED
 
-01:30.038 --> 01:31.039
+01:30.038 --> 01:30.098
 REDACTED
 
-01:32.560 --> 01:36.980
+01:32.560 --> 01:32.600
 REDACTED
 
-01:37.100 --> 01:37.685
+01:37.100 --> 01:37.141
 REDACTED
 
-01:39.860 --> 01:42.014
+01:39.860 --> 01:39.900
 REDACTED
 
-01:42.960 --> 01:43.124
+01:42.960 --> 01:43.062
 REDACTED
 
-01:44.538 --> 01:44.620
+01:44.538 --> 01:44.559
 REDACTED
 
-01:45.820 --> 01:47.458
+01:45.820 --> 01:45.861
 REDACTED
 
-01:47.680 --> 01:49.620
+01:47.680 --> 01:47.741
 REDACTED
 
-01:49.660 --> 01:51.518
+01:49.660 --> 01:49.761
 REDACTED
 
-01:54.140 --> 01:57.878
+01:54.140 --> 01:54.301
 REDACTED
 
-01:58.400 --> 02:03.240
+01:58.400 --> 01:58.541
 REDACTED
 
-02:18.761 --> 02:20.514
+02:18.761 --> 02:18.781
 REDACTED
 
-02:21.280 --> 02:23.316
+02:21.280 --> 02:21.381
 REDACTED
 
-02:25.300 --> 02:27.217
+02:25.300 --> 02:25.361
 REDACTED
 
-02:27.600 --> 02:32.379
+02:27.600 --> 02:27.620
 REDACTED
 
-02:32.840 --> 02:34.756
+02:32.840 --> 02:32.921
 REDACTED

pyannote.audio use guarded in code but not setup

pyannote.audio brings some great features but it is rightfully guarded in code since it requires additional credentials. Additionally it appears to be currently incompatible with the newest macs meaning install fails when it is included. The recommendation here would be to have the additional features it brings be an extra instead of part of the default install. Maybe just add:

#added to the setup() call in setup.py
#additionally, it should be removed from the requirements.txt
#I've never been good at naming things.
extras_require={"pyannote": ["pyannote.audio"]},

alternatively, the code can remove it from the setup entirely and detect, and gracefully handle, if it isn't installed when VAD and diarization are called. This may be preferred since decoupling the diarization/vad implementation could be useful in the future when there are multiple valid options.

Explore the usage of multilingual models

I think we need to explore multilingual models such as wav2vec2-xls-r-300m-21-to-en to see if the 300M models are better than the 53M models currently used for low resource languages and see if we could use a single model for multilingual alignment

I have the code written but not thoroughly tested so I might share it after the merging of #53 but I wanted to hear your thoughts about this first

The result is different with execute multiple times

1.command๏ผš
whisperx ./audio/diff-result-times.mp4 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --output_dir ./audio/ --language en --model small.en

2.audio๏ผš

diff-result-times.mp4

3.different results, for example๏ผš
image

Phoneme-level timestamps

Great work!
Is it possible to expand this to phoneme-level timestamps, instead of word-level timestamps?

For example, instead of
"[00:13.50->00:13.60] smiles"
have
"[00:13.50->00:13.53] s
[00:13.53->00:13.57] mi
[00:13.57->00:13.58] l
[00:13.58->00:13.60] es"

how to output joined segments for a longer on-screen session?

generated .ass is giving constant full segments with occasional word highlighting. how to make it so the segments are joined for a longer on-screen session?

ty @m-bain, looking forward to a sweet built-in diarization implementation! This will also be very useful for quick preview of audio sound clips when searching.

Explanation of use-case:


Would like to see if above question is easily achieved because my eyes follow the auto-generated captions on youtube videos, the scrolling, like rsvp (https://en.wikipedia.org/wiki/Rapid_serial_visual_presentation) and I only watch videos for that reason. It is sort of a speed reading-like tool, but instead of presenting one word at a time, which can be nausating, a more relaxed method is possible, where you can look at old text.

These .ass files generated are possible to view with audio files on android 12 mpv, when renaming them to .srt.
A display that can hold more text than 2 lines (unlike youtube) would be ideal.

Edit: HFValidationError + more

I can use the --diarize alone and it works, but if I add the --vad_filter ill get this error.

huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'pyannote\segmentation'.

Which doesnt make sense since the hf_token is validated when diarizing only. Aaaaaaand its here while writing this, that I just realized that the segmentation is used to improve the diarization and is another repository.

Also a simple cross-takling answer like yes and no, is often under 300 ms long. Pyannote only accurately detects speaker change every 2 seconds of spoken audio, which is almost 7 times slower than what is needed if we want to use as little "logic" as possible as I call it.

"Logic": Whisper almost always detects when a new sentence is needed, also if its an answer to a question. So having the timestamps of the spoken word(s) rerun the diarization, but with empty space in the beginning or as I do double the length of the audio file and then divide the time. Will 90% be treated as another speaker than the one who spoke before. But it still cant accurately detect who is talking and would just create a new speaker eg. Speaker_03, if its only 2 people who where talking. This means that there is a lot of annoying, but simple steps in listening to each unknown speaker and labelling them correctly, I would say that with my method the accuracy of guessing the speaker in the streched audio is around 50%, so not the worst.

I also briefly checked out on a few stats from Nvideo NeMo diarize/segments, and it seems to be a tiny bit better at handling switching speakers. But accuracy drops drastically when its nearing the 0,5-1 second mark (40% I think I saw.) But then again maybe the same method of strecthing the audio would help a bit, like it does with Pyannote. The bad thing with Nvidia NeMo is that it can only be trained with gpus that support CUDA or maybe can only run great on CUDA gpus, im not sure and havent tried as I only have an RX 580 8GB and a Ryzen 2700x.

Also maximum segment length in characters would be nice as when you get past the 70 charactersit begins to fill a lot on the subtitle floor. Like this port that uses --max-len "characters" too limit it very well.

Now I would not call myself a programmer, I just play around with things and cross my fingers that I made it work. I will look into It myself and maybe make a pull request if, I can improve or implement something. ๐Ÿ˜ƒ

Running on short audio: KeyError: 'level_1'

Failed to align segment: no characters in this segment found in model dictionary, resorting to original...
I was testing to align a audio file, but it didn't worked and give above error. It was a plain English .wav file

Error after updating to pandas commit.

I keep getting this error after the new commit to Pandas:

File "//whisper.py", line 47, in whisper
result_aligned = whisperx.align(segments, model_a, metadata, audio_file, device)
File "/usr/local/lib/python3.9/site-packages/whisperx/alignment.py", line 309, in align
word_segments_arr["segment-text-start"] = per_word_grp["level_1"].min().reset_index()["level_1"]
File "/usr/local/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1416, in getitem
return super().getitem(key)
File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 248, in getitem
raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: level_1'

instructions on updating

You provide installation instructions. It would be nice to mention how to update in the same area of the readme so that people who know enough to follow the install line can also be given the information to update. (even if that is just to perform the same command again)

Working with the new VAD Feature

I'm currently trying to work with the new VAD feature but I'm getting the following error:

TypeError: transcribe_with_vad() missing 1 required positional argument: 'vad_pipeline'

Is there sample code anywhere for transcribing with vad?

Korean Model Issues with Alignment and Transcription

Terminal (Main) align_extend 2
E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.360]  ์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ํ˜„์šฐ์ž…๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด๋กœ ๋งํ•ด์ฃผ์„ธ์š”.
[00:03.360 --> 00:09.120]  ์•„๋ฌด๋„ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ์‚ฌ๋žŒ์„ ๋“ฃ๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
[00:09.120 --> 00:18.240]  ๊ทธ๋Ÿฌ๋‚˜, ์ด๋Ÿฐ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์€ ํ•œ๊ตญ์–ด๋กœ ๋งํ•˜๋Š” ์ƒˆ๋กœ์šด ์–ธ์–ด๋ฅผ ๊ณต๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ์— ์ •๋ง ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
[00:18.240 --> 00:23.160]  ์˜์–ด๋กœ๋Š” ์งง์€ ๋ง๊ณผ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๊ฒƒ๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์—†์ฃ .
[00:23.160 --> 00:24.160]  ์˜ˆ๋ฅผ ๋“ค์–ด,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:30.920 --> 01:34.580]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ  ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์ง€๋งŒ
[01:34.580 --> 01:38.320]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํ”Œ ๋•Œ๋Š” ๋„ˆ๋ฅผ ์žก์•„๋จน์ง€ ์•Š์œผ๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ฒ ๋‹ค๋Š” ์•ฝ์†์€
[01:38.320 --> 01:41.000]  ์ง€๊ธˆ์€ ์ผ๋‹จ ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ
[01:41.000 --> 01:44.560]  100% ๋ณด์žฅํ•  ์ˆ˜๋Š” ์—†๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•ด์คฌ์œผ๋ฉด ์ข‹๊ฒ ๋Š”๋ฐ
[01:44.560 --> 01:45.600]  ๊ฐ€๋Šฅํ• ๊นŒ?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:49.600 --> 01:54.720]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ, ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:54.720 --> 01:59.680]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํŒŒ. ๊ทธ๋Ÿฌ๋ฉด ๋„ˆ๋ฅผ ์•ˆ ์žก์•„๋จน์–ด. ๋…ธ๋ ฅํ• ๊ฒŒ.
[02:24.720 --> 02:32.720]  ์žฌ๋ฏธ์žˆ๋Š” ์ฑ…, ํ•œ๊ตญ์–ด ๊ณต๋ถ€๋ฅผ ์ข€ ํ•˜๋ ค๊ณ  ์–ด๋””๋กœ ๊ฐ€๋ฉด ์ข‹์„์ง€ ์•„์ง ๋ชจ๋ฅด๊ฒ ์–ด์š”.
[02:32.720 --> 02:37.320]  ๋งˆ์นจ and ์šฐ์—ฐํžˆ are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:43.920]  ์ง‘์—๋งŒ ์žˆ์„ ์ƒ๊ฐ์ด์—ˆ์ง€๋งŒ.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:47.020 --> 02:49.020]  TALK TO ME IN KOREAN ์—์„œ ๋งŒ๋‚˜์š”!
[02:49.020 --> 02:49.520]  Bye!
Performing alignment...
[00:00.000 --> 00:00.502]  ์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ํ˜„์šฐ์ž…๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด๋กœ ๋งํ•ด์ฃผ์„ธ์š”.
[00:01.360 --> 00:02.021]  ์•„๋ฌด๋„ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ์‚ฌ๋žŒ์„ ๋“ฃ๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
[00:07.120 --> 00:08.282]  ๊ทธ๋Ÿฌ๋‚˜, ์ด๋Ÿฐ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์€ ํ•œ๊ตญ์–ด๋กœ ๋งํ•˜๋Š” ์ƒˆ๋กœ์šด ์–ธ์–ด๋ฅผ ๊ณต๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ์— ์ •๋ง ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
[00:16.240 --> 00:16.821]  ์˜์–ด๋กœ๋Š” ์งง์€ ๋ง๊ณผ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๊ฒƒ๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์—†์ฃ .
[00:21.160 --> 00:21.260]  ์˜ˆ๋ฅผ ๋“ค์–ด,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:28.920 --> 01:29.522]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ  ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์ง€๋งŒ
[01:32.580 --> 01:33.322]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํ”Œ ๋•Œ๋Š” ๋„ˆ๋ฅผ ์žก์•„๋จน์ง€ ์•Š์œผ๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ฒ ๋‹ค๋Š” ์•ฝ์†์€
[01:36.320 --> 01:36.781]  ์ง€๊ธˆ์€ ์ผ๋‹จ ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ
[01:39.000 --> 01:39.501]  100% ๋ณด์žฅํ•  ์ˆ˜๋Š” ์—†๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•ด์คฌ์œผ๋ฉด ์ข‹๊ฒ ๋Š”๋ฐ
[01:42.560 --> 01:42.640]  ๊ฐ€๋Šฅํ• ๊นŒ?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:47.600 --> 01:48.161]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ, ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:52.720 --> 01:53.321]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํŒŒ. ๊ทธ๋Ÿฌ๋ฉด ๋„ˆ๋ฅผ ์•ˆ ์žก์•„๋จน์–ด. ๋…ธ๋ ฅํ• ๊ฒŒ.
[02:22.720 --> 02:23.581]  ์žฌ๋ฏธ์žˆ๋Š” ์ฑ…, ํ•œ๊ตญ์–ด ๊ณต๋ถ€๋ฅผ ์ข€ ํ•˜๋ ค๊ณ  ์–ด๋””๋กœ ๊ฐ€๋ฉด ์ข‹์„์ง€ ์•„์ง ๋ชจ๋ฅด๊ฒ ์–ด์š”.
[02:30.720 --> 02:30.840]  ๋งˆ์นจ and ์šฐ์—ฐํžˆ are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  ์ง‘์—๋งŒ ์žˆ์„ ์ƒ๊ฐ์ด์—ˆ์ง€๋งŒ.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:45.020 --> 02:45.160]  TALK TO ME IN KOREAN ์—์„œ ๋งŒ๋‚˜์š”!
[02:49.020 --> 02:49.520]  Bye!

E:\Applications\WhisperX>
Terminal (Not described below) No align_extend parameter
E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.320]  ์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ํ˜„์šฐ์—์š”. ํ•œ๊ตญ์–ด๋กœ ๋งํ•ด์ฃผ์„ธ์š”.
[00:03.320 --> 00:09.060]  ์•„๋ฌด๋„ ๋ˆ„๊ตฐ๊ฐ€์—๊ฒŒ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์„ ๋“ฃ๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
[00:09.060 --> 00:18.280]  ๊ทธ๋Ÿฌ๋‚˜, ์ด๋Ÿฐ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์€ ํ•œ๊ตญ์–ด ๊ฐ™์€ ์ƒˆ๋กœ์šด ์–ธ์–ด๋ฅผ ๊ณต๋ถ€ํ•  ๋•Œ ์ •๋ง ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
[00:18.280 --> 00:23.100]  ์˜์–ด์—์„œ๋Š” ์งง์€ ๋ง๊ณผ ๊ธธ๊ฒŒ ๋ง์˜ ์ฐจ์ด๊ฐ€ ์ ์Šต๋‹ˆ๋‹ค.
[00:23.100 --> 00:24.200]  ์˜ˆ๋ฅผ ๋“ค์–ด,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:52.480 --> 00:58.720]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:03.220 --> 01:08.440]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:30.840 --> 01:34.580]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ  ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์ง€๋งŒ
[01:34.580 --> 01:38.280]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํ”Œ ๋•Œ๋Š” ๋„ˆ๋ฅผ ์žก์•„๋จน์ง€ ์•Š์œผ๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ฒ ๋‹ค๋Š” ์•ฝ์†์€
[01:38.280 --> 01:40.920]  ์ง€๊ธˆ์€ ์ผ๋‹จ ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ
[01:40.920 --> 01:44.520]  100% ๋ณด์žฅํ•  ์ˆ˜๋Š” ์—†๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•ด์คฌ์œผ๋ฉด ์ข‹๊ฒ ๋Š”๋ฐ
[01:44.520 --> 01:45.880]  ๊ฐ€๋Šฅํ• ๊นŒ?
[01:49.880 --> 01:54.920]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:54.920 --> 01:59.720]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํŒŒ. ๊ทธ๋Ÿฌ๋ฉด ๋„ˆ๋ฅผ ์•ˆ ์žก์•„๋จน์–ด. ๋…ธ๋ ฅํ• ๊ฒŒ.
[02:24.920 --> 02:32.920]  ์žฌ๋ฏธ์žˆ๋Š” ์ฑ…, ํ•œ๊ตญ์–ด ๊ณต๋ถ€๋ฅผ ์ข€ ํ•˜๋ ค๊ณ  ์–ด๋””๋กœ ๊ฐ€๋ฉด ์ข‹์„์ง€ ์•„์ง ๋ชจ๋ฅด๊ฒ ์–ด์š”.
[02:32.920 --> 02:37.360]  ๋งˆ์นจ and ์šฐ์—ฐํžˆ are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:44.000]  ์ง‘์—๋งŒ ์žˆ์„ ์ƒ๊ฐ์ด์—ˆ์ง€๋งŒ.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:47.000 --> 02:49.000]  TalkToMeInKorean์—์„œ ๋งŒ๋‚˜์š”!
[02:49.000 --> 02:49.500]  Bye!
Performing alignment...
[00:00.000 --> 00:00.482]  ์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ํ˜„์šฐ์—์š”. ํ•œ๊ตญ์–ด๋กœ ๋งํ•ด์ฃผ์„ธ์š”.
[00:01.340 --> 00:02.062]  ์•„๋ฌด๋„ ๋ˆ„๊ตฐ๊ฐ€์—๊ฒŒ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์„ ๋“ฃ๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
[00:07.060 --> 00:08.102]  ๊ทธ๋Ÿฌ๋‚˜, ์ด๋Ÿฐ ๊ธธ๊ฒŒ ๋งํ•˜๋Š” ๋ง์€ ํ•œ๊ตญ์–ด ๊ฐ™์€ ์ƒˆ๋กœ์šด ์–ธ์–ด๋ฅผ ๊ณต๋ถ€ํ•  ๋•Œ ์ •๋ง ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
[00:16.280 --> 00:16.801]  ์˜์–ด์—์„œ๋Š” ์งง์€ ๋ง๊ณผ ๊ธธ๊ฒŒ ๋ง์˜ ์ฐจ์ด๊ฐ€ ์ ์Šต๋‹ˆ๋‹ค.
[00:21.100 --> 00:21.220]  ์˜ˆ๋ฅผ ๋“ค์–ด,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:50.480 --> 00:51.061]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:01.220 --> 01:01.821]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:28.840 --> 01:29.442]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ  ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์ง€๋งŒ
[01:32.580 --> 01:33.322]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํ”Œ ๋•Œ๋Š” ๋„ˆ๋ฅผ ์žก์•„๋จน์ง€ ์•Š์œผ๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ฒ ๋‹ค๋Š” ์•ฝ์†์€
[01:36.280 --> 01:36.761]  ์ง€๊ธˆ์€ ์ผ๋‹จ ํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ
[01:38.920 --> 01:39.441]  100% ๋ณด์žฅํ•  ์ˆ˜๋Š” ์—†๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•ด์คฌ์œผ๋ฉด ์ข‹๊ฒ ๋Š”๋ฐ
[01:42.520 --> 01:42.600]  ๊ฐ€๋Šฅํ• ๊นŒ?
[01:47.880 --> 01:48.481]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[01:52.920 --> 01:53.501]  ๋‚ด๊ฐ€ ๋ฐฐ๊ฐ€ ์•ˆ ๊ณ ํŒŒ. ๊ทธ๋Ÿฌ๋ฉด ๋„ˆ๋ฅผ ์•ˆ ์žก์•„๋จน์–ด. ๋…ธ๋ ฅํ• ๊ฒŒ.
[02:22.920 --> 02:23.721]  ์žฌ๋ฏธ์žˆ๋Š” ์ฑ…, ํ•œ๊ตญ์–ด ๊ณต๋ถ€๋ฅผ ์ข€ ํ•˜๋ ค๊ณ  ์–ด๋””๋กœ ๊ฐ€๋ฉด ์ข‹์„์ง€ ์•„์ง ๋ชจ๋ฅด๊ฒ ์–ด์š”.
[02:30.920 --> 02:31.040]  ๋งˆ์นจ and ์šฐ์—ฐํžˆ are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  ์ง‘์—๋งŒ ์žˆ์„ ์ƒ๊ฐ์ด์—ˆ์ง€๋งŒ.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:45.000 --> 02:45.120]  TalkToMeInKorean์—์„œ ๋งŒ๋‚˜์š”!
[02:49.000 --> 02:49.500]  Bye!

E:\Applications\WhisperX>

Environment

OS: Windows 10
Python: 3.9.9
WhisperX: e909f2f
Whisper Model: Large
Alignment Model: w11wo/wav2vec2-xls-r-300m-korean

Input

https://www.youtube.com/watch?v=GM2Ki_FmF5U (720p version, audio + video, mp4, I let WhisperX preprocess it)

WhisperX Command

whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2

Issues

Note: Everything below is described using align_extend 2 as shown in the command above and the Terminal (Main) details above

Translating English to Korean when I don't want it to

In the input, the speaker uses a mix of English and Korean. In the video's introduction, English is spoken and later in the video, such as during the example sentences, he switches to Korean. Instead of returning the introduction in English, you can see that it instead translated the English sentences into Korean for some reason.

This behavior is inconsistent too. For example, at 0:00:24, English is spoken and WhisperX transcribes it in English. That is fine. However, as mentioned above, during the introduction it transcribed it from English into Korean. I have no clue why that is.

Alignment Issues

EDIT: SOLVED I remembered that in #7 there was mention of having to us Chinese instead of cn, and so took a look at the issue again and saw it was regarding alignment. I changed kr to Korean for the language parameter and this issue was resolved.

At first I tried passing in no align_extend parameter. That made the transcribed captions even worse. I then used align_extend 2 as given in the Japanese example in the README which improved my results. It made everything better, and the English portions are lined up. However, the issue is the first occurrence (0:00:52) of the following line below is not aligned properly at all:

๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.

The transcription is correct. The issue is the alignment. As you can see from the output of the command line, initially everything is correct before it performs the alignment

[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.

However, after the alignment, it seems like of the Korean sentences' alignments are less than a second long. Here is just one example, but if you look at the output above, you will notice it's the case for all Korean sentences post-alignment.

[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  ๋‚˜๋Š” ์‚ฌ์ž์•ผ. ๋„ˆ๋Š” ํ† ๋ผ์•ผ. ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  ๋‚˜๋Š” ์‚ฌ์ž๊ณ , ๋„ˆ๋Š” ํ† ๋ผ๋‹ˆ๊นŒ, ์šฐ๋ฆฌ๋Š” ์นœ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์—†์–ด.

Support inference from fine-tuned ๐Ÿค— transformers Whisper models

Hi @m-bain,

This is a very cool repository and definitely useful for getting more reliable and accurate timestamps for the generated transcriptions.
I was wondering if you'd like to extend the current transcription codebase to also support transformers fine-tuned Whisper checkpoints as well.

For context, we recently ran a Whisper fine-tuning event powered by ๐Ÿค— transformers and over the course of the event we managed to fine-tune 650+ Whisper checkpoints, across 112 languages. You can find the leaderboard here: https://huggingface.co/spaces/whisper-event/leaderboard

In most all the cases the fine-tuned models beat the original Whisper model's zero shot performance by a huge margin.

I think It'll be of huge benefit for the community to be able to utilise these models with your repo. Happy to support you if you have any questions from ๐Ÿค— transformers side. :)

Cheers,
VB

missing pip install soundfile

Thank you for your work but unless one also does 'pip install soundfile' during the installation of this they will get an error during initialization of WhiperX.

I would suggest adding that step to the readme as even though it doesn't stop operation it does throw a warning.

Add option to only align text to audio

Would be nice, if there was a flag to pass a srt or other text file directly to the alignment, skipping the text recognition.
E.g. sometimes I already have a ground truth transcription and I only want to align it to the audio.
Is this possible?

Use of fine-tuned Whisper

What a great tool! Is it somehow possible to use also a version of Whisper that has been fine-tuned? I have one model trained with transformers on Hugging Face.
Thanks.

"word-level" doesn't always exist

I tried to translate from japanese to english and use whisperx.
In exactly one entry it's missing a "word-level" in it's alignment dict, causing utils.write_ass to fail.

Alignment output entry:

    {
        "id": 654,
        "seek": 518644,
        "start": 5198.74,
        "end": 5198.7404,
        "text": " Shizu",
        "tokens": [
            1160,
            590,
            84
        ],
        "temperature": 1,
        "avg_logprob": -4.464975124452172,
        "compression_ratio": 1.0123456790123457,
        "no_speech_prob": 0.07407991588115692
    },

Can't process MP3 files with VAD or diarization

Hi @m-bain. Thank you so much for your amazing work!

I wanted to test your new VAD feature, but I get an error stating that:

Traceback (most recent call last):
  File "/usr/local/bin/whisperx", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 451, in cli
    result = transcribe_with_vad(model, audio_path, vad_pipeline, temperature=temperature, **args)
  File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 310, in transcribe_with_vad
    vad_segments = vad_pipeline(audio)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/pipeline.py", line 238, in __call__
    return self.apply(file, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/pipelines/voice_activity_detection.py", line 197, in apply
    segmentations: SlidingWindowFeature = self._segmentation(file)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/inference.py", line 328, in __call__
    waveform, sample_rate = self.model.audio(file)
  File "/usr/local/lib/python3.8/dist-packages/pyannote/audio/core/io.py", line 278, in __call__
    waveform, sample_rate = torchaudio.load(file["audio"])
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/backend/soundfile_backend.py", line 205, in load
    with soundfile.SoundFile(filepath, "r") as file_:
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1183, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "/usr/local/lib/python3.8/dist-packages/soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'xxxx.mp3': File contains data in an unknown format.

I've tried with different (valid) mp3 files and each time it results in this error.

I'm using whisperx on Google Collab.

// EDIT: I get the same error when I try the diarization feature.

Error when trying to use the pt align model.

I saw that support for portuguese was added a few commits ago and decided to give it a go. But when loading the align model this error happens:

ValueError: The chosen align_model "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese" could not be found in huggingface (https://huggingface.co/models) or torchaudio (https://pytorch.org/audio/stable/pipelines.html#id14)

How to pass in traditional whisper CLI parameters?

I cannot find this anywhere in the documentation. In the whisperx transcribe function there is a massive section of optional parameters that can be passed in. How can I actually use these in python?

    # parser.add_argument("--model", default="small", choices=available_models(), help="name of the Whisper model to use")
    # parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
    # parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
    # # alignment params
    # parser.add_argument("--align_model", default=None, help="Name of phoneme-level ASR model to do alignment")
    # parser.add_argument("--align_extend", default=2, type=float, help="Seconds before and after to extend the whisper segments for alignment")
    # parser.add_argument("--align_from_prev", default=True, type=bool, help="Whether to clip the alignment start time of current segment to the end time of the last aligned word of the previous segment")
    # parser.add_argument("--drop_non_aligned", action="store_true", help="For word .srt, whether to drop non aliged words, or merge them into neighbouring.")

    # parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
    # parser.add_argument("--output_type", default="srt", choices=['all', 'srt', 'vtt', 'txt'], help="File type for desired output save")

    # parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")

    # parser.add_argument("--task", type=str, default="transcribe", choices=["transcribe", "translate"], help="whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate')")
    # parser.add_argument("--language", type=str, default=None, choices=sorted(LANGUAGES.keys()) + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), help="language spoken in the audio, specify None to perform language detection")

    # parser.add_argument("--temperature", type=float, default=0, help="temperature to use for sampling")
    # parser.add_argument("--best_of", type=optional_int, default=5, help="number of candidates when sampling with non-zero temperature")
    # parser.add_argument("--beam_size", type=optional_int, default=5, help="number of beams in beam search, only applicable when temperature is zero")
    # parser.add_argument("--patience", type=float, default=None, help="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search")
    # parser.add_argument("--length_penalty", type=float, default=None, help="optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default")

    # parser.add_argument("--suppress_tokens", type=str, default="-1", help="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations")
    # parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")
    # parser.add_argument("--condition_on_previous_text", type=str2bool, default=False, help="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop")
    # parser.add_argument("--fp16", type=str2bool, default=True, help="whether to perform inference in fp16; True by default")

    # parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
    # parser.add_argument("--compression_ratio_threshold", type=optional_float, default=2.4, help="if the gzip compression ratio is higher than this value, treat the decoding as failed")
    # parser.add_argument("--logprob_threshold", type=optional_float, default=-1.0, help="if the average log probability is lower than this value, treat the decoding as failed")
    # parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")
    # parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")

Where would I actually put this? Transcribe does not seem to have an input parameter for these, neither does load_model.

    model = whisperx.load_model(modelSize, device)
    result = model.transcribe(audio_file)
    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

Specifically interested in --threads, --beam_size, --patience, and --best_of

Error when adding torch, and also how to add --diarize/parameters when using python

I have cloned WhisperX and I'm getting a syntax error, about getting torch 1.8. But it seems like there is a syntax at error "torch (>=1.8.)" when ran, it should probably be replaced with "torch >=1.8.0" or "torch >=1.8,<1.9" if torch 1.9 also works.

epkg_resources.extern.packaging.requirements.InvalidRequirement: Expected closing RIGHT_PARENTHESIS - torch (>=1.8.*)

I'm also interested in learning how to add parameters in python.

Index out-of-range access to word_segment_list

@m-bain Thank you for whisperX!

In some audio, I found IndexError: list index out of range error happens at aligning stage.

Traceback (most recent call last):                                                                            
  File "/home/syoyo/miniconda3/envs/whisperx/bin/whisperx", line 8, in <module>                               
    sys.exit(cli())                                                                                           
  File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 505, in
 cli                                                                                                          
    result_aligned = align(result["segments"], align_model, align_metadata, audio_path, device,              
  File "/home/syoyo/miniconda3/envs/whisperx/lib/python3.8/site-packages/whisperx/transcribe.py", line 374, in
 align                                                                                                        
    word_segments_list[-1]['text'] += ' ' + curr_word                                                        
IndexError: list index out of range 

How to reproduce

https://commonvoice.mozilla.org/en/datasets

Download Japanese -> Common Voice Copus 12.0

image

$ whisperx --model large --language ja cv-corpus-12.0-2022-12-07/ja/clips/common_voice_ja_35797612.mp3

Investigation

word_segments_list[-1]['text'] += ' ' + curr_word

Here is the dump of t_local and t_words before for x in range(len(t_local)):

t_local = [None, None, (0.842688, 1.083456), (1.424544, 1.5048), (1.524864, 1.544928), (1.665312, 1.7455679999999998), (1.7656319999999999, 1.785696), (1.865952, 2.026464), (2.046528, 2.186976), (2.20704, 2.2271039999999998), (2.247168, 2.327424), (2.347488, 2.367552), (2.427744, 2.508), (2.548128, 2.568192), (2.688576, 2.969472), None, (3.089856, 3.10992), (3.129984, 3.2102399999999998), (3.31056, 3.8322239999999996), (3.852288, 3.9325439999999996), (4.052928, 4.072992), (4.133184, 4.2535680000000005), (4.333824, 4.434144), (4.494336, 4.715039999999999), (4.735104, 4.755168), (4.775232, 5.196576), (5.31696, 5.457407999999999), (5.597856, 5.678112), (5.738303999999999, 5.81856), (5.838624, 5.898815999999999), (5.9790719999999995, 6.0793919999999995), (6.159648, 6.199776), (6.21984, 6.280032), (6.360288, 6.460608000000001), (6.480671999999999, 6.5207999999999995), (6.641183999999999, 6.761568), (6.781632, 6.801696), None]
t_words = ['็ซฅ', '่ฒž', 'ๅŠฉ', 'ใ‹', 'ใ‚‰', 'ใช', 'ใ„', 'ใจ', 'ๆ€', 'ใฃ', 'ใฆ', 'ใ„', 'ใ‚‹', 'ใ‹', 'ใ‚‰', 'ใ€', 'ใ„', 'ใ‚‹', 'ใจ', 'ใ†', 'ใฃ', 'ใจ', 'ใ•', 'ใ‚Š', 'ใฃ', 'ใจ', '้Ÿณ', 'ใŒ', 'ใ—', 'ใฆ', '็›ฎ', 'ใ‹', 'ใ‚‰', '็ซ', 'ใŒ', 'ๅ‡บ', 'ใŸ', 'ใ€‚']

Probably we need to consider a situation t_local may contain multiple None s from the start?

word level confidences

Hi,

I was trying out your package, it seems like a pretty useful addition to whisper.

I was wondering if you had any plans to add word level confidences / logprobs, eg similar to this ticket?
openai/whisper#284

Thanks!

Needed steps to add a new language

Hi, do I just find a compatible Wav2Vec2 model for the language and add it to the models list or any other steps needed? also is there are any modifications needed for RTL languages such as Arabic?

Failed to align

Thank you for this rep first!
When I tried to align an audio, I met this error.

File"/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align "/home/ubuntu/.conda/envs/whisper/lib/python3.9/site-packages/whisperx/alignment.py", line 67, in backtrack raise ValueError("Failed to align") ValueError: Failed to align
Any idea to fix this?

OOM && Large GPU usage

Hi, I am using this, but it doesn't seems to release the gpu vram. I am using large-v2 model and it's using 16gigs of GPU ram. Which I don't think is NORMAL at all, moreover it doesn't free up the ram. Is this normal?

Does it support Chinese?

Hi,

I couldn't find align model used for Chinese in "torchaudio.pipelines". Does it support Chinese?

Thanks

Length of the written text

If someone will speak too long than the whole text will appear throught the entire screen at that timeframe so is there any option to cut this in the pieces?

Got IndexError: list index out of range

At the Performing alignment... part got IndexError: list index out of range error
whisperx --model large --task translate --language ja movie.mp4 --align_model jonatasgrosman/wav2vec2-large-xlsr-53-japanese

cmd_mi3f2FHNZm

amazing project

hello team, thank you for this project!

Just wondering is there a possibility to incorporate whisper mic for real time processing ?

Low Accuracy on German

I wanted to use WhisperX to do forced alignment on the Mozilla Common Voice German Dataset, but the words are often cut of or the segments do not align at all.

Additionally, some audio tracks are recognized as Farsi instead of German.

Is it because of the short duration of these clips (< 2-5 seconds, each)?
And how can I improve this accuracy?

Is the accuracy of the english models (for english audio) better?

Align Models and Fine-Tuning

I'm trying to use whisperX on the Korean language and came across some issues on how to do so. Since there's no default model, I went over to HF to find a model to use. As expected, there are many models and I'm not too sure which one to choose, especially because some of the recent models trained by slplab are lacking evaluation results. Also, not really sure what they are evaluated on or any other models. Personally I am not familiar with wav2vec2 or the evaluation benchmark it uses let alone what these other trained models are benchmarked against.

Let me get back on tangent, does it matter if the model I select was fine-tuned on wav2vec2-large-xlsr-53? I'm asking because the current default Japanese model was fine-tuned on it and is now the default. For Chinese, another fine-tuned model on wav2vec2-large-xlsr-53 was again selected according to #7.

Do I have to choose a model like slplab/wav2vec2-large-xlsr-53-korean-samsung-54k or could I use thisisHJLee/wav2vec2-large-xls-r-1b-korean-sample5?

Still some incoherent timestamps in the srt file

@m-bain Thank you so much for your amazing work.
There are still some incoherent timestamps in the word-level srt files (it was the case for 139 files out of 360 on my data). I'm about to write a Python script to parse all the srt files and fix the concerned timestamps, but maybe there is a way to avoid them from the beginning? It makes it hard to convert them into TextGrid files... (I use https://github.com/rctatman/SrtToTextgrid ) Beside that, Whisperx is working so well!!

Diarized transcription in .txt file?

The diarize version just puts the audio in an .ass file atm as far as I can see - is it possible in the current version to get the diarized output in a txt file?

Been trying this out with KBLab/wav2vec2-large-voxrex-swedish for Swedish, and while I lack the hardware for extensive testing atm, it seems to be working fine.

Suggested enhancement: increase length of resulting segments

increase length of resulting segments
I assume that the resulting segments resemble the black lines in the image. In the image, the segments have been lengthened at the beginning (red line), and at the end (green line).

Is it possible to add parameters for the optional lengthening of all the resulting segments at the beginning and at the end?

[parameter 1:] The red line is an example of the result of a parameter to lengthen all segments at the beginning (in milliseconds).

[parameter 2:] The green line is an example of the result of a parameter to lengthen all segments at the end (in milliseconds).

[parameter 3:] A third parameter is for the minimum distance (in milliseconds) between the end of a (lengthened) segment and the beginning of the next (lengthened) segment. This third parameter may overrule the other two parameters to prevent a collision.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.