For example, with a silent segment from 00:01:00 to 00:01:30, after being processed by

Enabling VAD results in subtitle timing issue about whisper-ctranslate2 HOT 3 CLOSED

softcatala commented on May 15, 2024

Enabling VAD results in subtitle timing issue

from whisper-ctranslate2.

Comments (3)

jordimas commented on May 15, 2024

whisper-ctranslate version 0.1.8 or higher expose the following vad filter parameters:

 --vad_filter VAD_FILTER
                        Enable the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model
                        https://github.com/snakers4/silero-vad. (default: False)
  --vad_threshold VAD_THRESHOLD
                        When `vad_filter` is enabled, probabilities above this value are considered as speech. (default: None)
  --vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS
                        When `vad_filter` is enabled, final speech chunks shorter min_speech_duration_ms are thrown out. (default: None)
  --vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S
                        When `vad_filter` is enabled, Maximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence. (default: None)
  --vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS
                        When `vad_filter` is enabled, in the end of each speech chunk time to wait before separating it. (default: None)

You can play with the different parameters that suite your audio file. Here you have a more detailed description of the meaning of these parameters: https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py#L28 You probably want to start with the vad_threshold parameter.

from whisper-ctranslate2.

old9 commented on May 15, 2024

Thanks, tuning some of these parameters does improve the issue, but also introduces some other issues like occasional sentence missing, still trying to figure out the right combination.

from whisper-ctranslate2.

mayeaux commented on May 15, 2024

Thanks, tuning some of these parameters does improve the issue, but also introduces some other issues like occasional sentence missing, still trying to figure out the right combination.

It's a known issue and a pretty bad one.

SYSTRAN/faster-whisper#125

The issue is that the Whisper timestamps can be a bit inaccurate so if your timestamp 'starts a little early' in the audio file that is processed with VAD, once you apply the timesplits back to the existing transcription then it can appear to the post processing that the subtitle starts before the silence. I'm not sure what the best approach is, perhaps transcribing each segment separately and then re-attaching them, but in my testing this leads to some of its own issues. guillaumekln is working on it though I know, we'll see what he comes up with, I am a little mystified myself as to what the solution is.

from whisper-ctranslate2.

Enabling VAD results in subtitle timing issue about whisper-ctranslate2 HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent