The timestamps I am getting from whisperX are way off (we are talking about 10-15 seco

Did you test with both no_align=True and <code class=

Oh, that's what you mean by no_align=True and <code c

My timestamps with whisperX are way off,about m-bain/whisperx

Lauler commented on July 16, 2024 1

I have used WhisperX with KBLab/wav2vec2-large-voxrex-swedish as an alignment model without any issues @tophee .

When you open an issue, please try to

Include a reproducible example (code + short audio clip that reproduces your error).
Include which versions of relevant libraries you have installed in your environment.

For me, the following code snippet produces high quality alignments:

import whisperx

device = "cuda"
audio_file = "data/1991/RD_EN_L_1991-04-09_1991-04-10.1.mp3"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

model = whisperx.load_model("large-v2", device, compute_type=compute_type, language="sv")
audio = whisperx.load_audio(audio_file, sr=16000)

result = model.transcribe(audio, batch_size=batch_size)
model_a, metadata = whisperx.load_align_model(
    model_name="KBLab/wav2vec2-large-voxrex-swedish", device=device, language_code="sv"
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

Versions of libraries:

whisperx                3.1.1
torch                   2.0.0
transformers            4.31.0

I don't think there's any issue with the alignment model. But it's not possible to help you without details about your code and environment.

*Edit: One source of error for mp3 files can be that certain media players decode the duration incorrectly. I would for example avoid the default media player in Ubuntu (the Videos app) when checking alignments, as timestamps can be unreliable in that app. Use something like VLC instead.

from whisperx.

SeeknnDestroy commented on July 16, 2024

try to set condition_on_previous_text=False if you set it to True previously. Let me know if it works

from whisperx.

tophee commented on July 16, 2024

I had that set to False from the outset.

from whisperx.

SeeknnDestroy commented on July 16, 2024

Did you test with both no_align=True and no_align=False?

from whisperx.

tophee commented on July 16, 2024

I'm not aware of that setting. Is it available via Python? What does it do?

from whisperx.

SeeknnDestroy commented on July 16, 2024

of course. alignment model can heavily improve your timestamp accuracy.

import whisperx
import gc 

device = "cuda" 
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

from whisperx.

tophee commented on July 16, 2024

Oh, that's what you mean by no_align=True and no_align=False.

Of course I am using the alignment model. That's what I meant by wav2vec2 model in the OP.

So you are suggesting I should turn it off?

from whisperx.

SeeknnDestroy commented on July 16, 2024

I genuinely don't know if it helps, but why not give it a try? Maybe alignment model for your target language is not working properly?

from whisperx.

nkilm commented on July 16, 2024

Try specifying a different pre-trained model for alignment, maybe that's the issue. You can make use of this script, it's a wrapper on top of whisperx to customise the pre-trained models and run in offline mode.

https://github.com/nkilm/offline-whisperx

from whisperx.

tophee commented on July 16, 2024

OK, so you are confirming that it’s the alignment model that is causing the problem. Thing is: it’s not so easy to find one for Swedish.

Thanks for the link to the script. I’m not sure I understand the offline part, though. Isn’t whisperX offline by default?

I’ll have to take a closer look to figure out the role of the wespeaker model. Does it replace the wav2vec2 model?

from whisperx.

nkilm commented on July 16, 2024

OK, so you are confirming that it’s the alignment model that is causing the problem. Thing is: it’s not so easy to find one for Swedish.

Thanks for the link to the script. I’m not sure I understand the offline part, though. Isn’t whisperX offline by default?

I’ll have to take a closer look to figure out the role of the wespeaker model. Does it replace the wav2vec2 model?

I'm not an expert in ASR tasks, but as you've mentioned that the timestamps are way off it could be due to some issues with the alignment model.

Well, whisperx isn't offline completely. You'll have to provide the huggingface token initially to download the models when you run for the first time. Suppose if you are running in a restricted network where you can't download any models/executables from external sources, then this is not possible with current version of whisperx, that's where this script will help you. Related issue - #263

The wespeaker model is for Speaker Embedding which is used in Diarization task. For alignment wav2vec2 is used in whisperx.

from whisperx.

nkilm commented on July 16, 2024

@tophee I just went through the code in whisperx, it doesn't have alignment model setup for Swedish.

whisperX/whisperx/alignment.py

Lines 24 to 58 in f2da2f8

    
           DEFAULT_ALIGN_MODELS_TORCH = { 
        
               "en": "WAV2VEC2_ASR_BASE_960H", 
        
               "fr": "VOXPOPULI_ASR_BASE_10K_FR", 
        
               "de": "VOXPOPULI_ASR_BASE_10K_DE", 
        
               "es": "VOXPOPULI_ASR_BASE_10K_ES", 
        
               "it": "VOXPOPULI_ASR_BASE_10K_IT", 
        
           } 
        
           DEFAULT_ALIGN_MODELS_HF = { 
        
               "ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese", 
        
               "zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn", 
        
               "nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch", 
        
               "uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm", 
        
               "pt": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese", 
        
               "ar": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic", 
        
               "cs": "comodoro/wav2vec2-xls-r-300m-cs-250", 
        
               "ru": "jonatasgrosman/wav2vec2-large-xlsr-53-russian", 
        
               "pl": "jonatasgrosman/wav2vec2-large-xlsr-53-polish", 
        
               "hu": "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian", 
        
               "fi": "jonatasgrosman/wav2vec2-large-xlsr-53-finnish", 
        
               "fa": "jonatasgrosman/wav2vec2-large-xlsr-53-persian", 
        
               "el": "jonatasgrosman/wav2vec2-large-xlsr-53-greek", 
        
               "tr": "mpoyraz/wav2vec2-xls-r-300m-cv7-turkish", 
        
               "da": "saattrupdan/wav2vec2-xls-r-300m-ftspeech", 
        
               "he": "imvladikon/wav2vec2-xls-r-300m-hebrew", 
        
               "vi": 'nguyenvulebinh/wav2vec2-base-vi', 
        
               "ko": "kresnik/wav2vec2-large-xlsr-korean", 
        
               "ur": "kingabzpro/wav2vec2-large-xls-r-300m-Urdu", 
        
               "te": "anuragshas/wav2vec2-large-xlsr-53-telugu", 
        
               "hi": "theainerd/Wav2Vec2-large-xlsr-hindi", 
        
               "ca": "softcatala/wav2vec2-large-xlsr-catala", 
        
               "ml": "gvs/wav2vec2-large-xlsr-malayalam", 
        
               "no": "NbAiLab/nb-wav2vec2-1b-bokmaal", 
        
               "nn": "NbAiLab/nb-wav2vec2-300m-nynorsk", 
        
           }

You can manually download the model KBLab/wav2vec2-large-voxrex-swedish and use https://github.com/nkilm/offline-whisperx for transcription, diarization, alignment, VAD etc.

from whisperx.

tophee commented on July 16, 2024

Yes, there is no "built-in" alignment model for Swedish. That's why I'm using KBLab/wav2vec2-large-voxrex-swedish.

This is my alignment code:

    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device, model_name="KBLab/wav2vec2-large-voxrex-swedish")
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

whisperx isn't offline completely. You'll have to provide the huggingface token initially to download the models when you run for the first time.

Oh, I see. I'm just reading the model from the cache (model_dir = "/Users/xhxxch/.cache/whisper/") but you are right. I needed internet when I ran it for the first time.

The wespeaker model is for Speaker Embedding which is used in Diarization task. For alignment wav2vec2 is used in whisperx.

All the wespeaker models are pretrained in English or Chinese. Do you know whether these will work with other languages?

from whisperx.

nkilm commented on July 16, 2024

Yes, there is no "built-in" alignment model for Swedish. That's why I'm using KBLab/wav2vec2-large-voxrex-swedish.

This is my alignment code:
    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device, model_name="KBLab/wav2vec2-large-voxrex-swedish")
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

What is the language code being selected here for Swedish? The language will be detected in first 30s with probability score being displayed. Based on the language, the corresponding alignment model will be picked.

whisperx isn't offline completely. You'll have to provide the huggingface token initially to download the models when you run for the first time.

Oh, I see. I'm just reading the model from the cache (model_dir = "/Users/xhxxch/.cache/whisper/") but you are right. I needed internet when I ran it for the first time.

Yeah, we need to have internet connection initially to download the models.

The wespeaker model is for Speaker Embedding which is used in Diarization task. For alignment wav2vec2 is used in whisperx.

All the wespeaker models are pretrained in English or Chinese. Do you know whether these will work with other languages?

I doubt if it'll work with good accuracy for other languages. I found this for Swedish - https://spraakbanken.gu.se/en/resources/embeddings. I'll share if I find anything useful.

from whisperx.

My timestamps with whisperX are way off about whisperx HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	DEFAULT_ALIGN_MODELS_TORCH = {
	"en": "WAV2VEC2_ASR_BASE_960H",
	"fr": "VOXPOPULI_ASR_BASE_10K_FR",
	"de": "VOXPOPULI_ASR_BASE_10K_DE",
	"es": "VOXPOPULI_ASR_BASE_10K_ES",
	"it": "VOXPOPULI_ASR_BASE_10K_IT",
	}

	DEFAULT_ALIGN_MODELS_HF = {
	"ja": "jonatasgrosman/wav2vec2-large-xlsr-53-japanese",
	"zh": "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn",
	"nl": "jonatasgrosman/wav2vec2-large-xlsr-53-dutch",
	"uk": "Yehor/wav2vec2-xls-r-300m-uk-with-small-lm",
	"pt": "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese",
	"ar": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
	"cs": "comodoro/wav2vec2-xls-r-300m-cs-250",
	"ru": "jonatasgrosman/wav2vec2-large-xlsr-53-russian",
	"pl": "jonatasgrosman/wav2vec2-large-xlsr-53-polish",
	"hu": "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian",
	"fi": "jonatasgrosman/wav2vec2-large-xlsr-53-finnish",
	"fa": "jonatasgrosman/wav2vec2-large-xlsr-53-persian",
	"el": "jonatasgrosman/wav2vec2-large-xlsr-53-greek",
	"tr": "mpoyraz/wav2vec2-xls-r-300m-cv7-turkish",
	"da": "saattrupdan/wav2vec2-xls-r-300m-ftspeech",
	"he": "imvladikon/wav2vec2-xls-r-300m-hebrew",
	"vi": 'nguyenvulebinh/wav2vec2-base-vi',
	"ko": "kresnik/wav2vec2-large-xlsr-korean",
	"ur": "kingabzpro/wav2vec2-large-xls-r-300m-Urdu",
	"te": "anuragshas/wav2vec2-large-xlsr-53-telugu",
	"hi": "theainerd/Wav2Vec2-large-xlsr-hindi",
	"ca": "softcatala/wav2vec2-large-xlsr-catala",
	"ml": "gvs/wav2vec2-large-xlsr-malayalam",
	"no": "NbAiLab/nb-wav2vec2-1b-bokmaal",
	"nn": "NbAiLab/nb-wav2vec2-300m-nynorsk",
	}