When transcribing a 3min audio with basic parameters and no stem, the resulting .srt f

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Audio only part time transcribed and each time a different one?,about mahmoudashraf97/whisper-diarization

Comments (3)

transcriptionstream commented on May 27, 2024

Any other detail? Version of python in use? Errors?

from whisper-diarization.

Psarpei commented on May 27, 2024

Hey @transcriptionstream thanks for your reply!

Python 3.10 I dont get any errors

60
00:05:28,056 --> 00:05:32,820
Speaker 1: Right now we spend the same amount of compute on each token, a dumb one, or like figuring out some complicated math.

61
00:05:32,820 --> 00:05:33,700
Speaker 1: !

62
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

until 60 everything worked fine and accurate but after that there is a lot of spoken text which is missing and after that comes in the audio the part of 62 so it skipps it

when I repeat it, the skipped audio part differs in length

47
00:04:57,496 --> 00:05:02,519
Speaker 0: So, you know, to generate every new word, it's essentially doing the same thing.

48
00:05:02,519 --> 00:05:33,700
Speaker 0: !

49
00:05:33,700 --> 00:05:36,383
Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

now the skipped part is way longer but the last sentence is still there

from whisper-diarization.

Psarpei commented on May 27, 2024

Thats the full logging

python diarize.py -a /home/pascal/code/video_translator/data/sent_lvl_sd/bgates_saltmann2/audio_file_enh.wav --whisper-model large-v3 --suppress_numerals --device cuda --language en
/home/pascal/anaconda3/envs/whisper_diar_inf/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
[NeMo W 2024-03-27 17:19:05 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model.
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original...
[NeMo I 2024-03-27 17:20:14 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-03-27 17:20:14 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:14 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-03-27 17:20:14 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-03-27 17:20:15 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-03-27 17:20:15 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16
[NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-03-27 17:20:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-03-27 17:20:15 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false
}
[NeMo I 2024-03-27 17:20:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.29it/s]
[NeMo I 2024-03-27 17:20:16 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-03-27 17:20:16 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:16 collections:446] Dataset loaded with 8 items, total duration of 0.10 hours.
[NeMo I 2024-03-27 17:20:16 collections:448] # 8 files loaded accounting to # 1 labels
vad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.35it/s]
[NeMo I 2024-03-27 17:20:17 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2024-03-27 17:20:18 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.57it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 343 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 343 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 10.06it/s]
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 420 items, total duration of 0.13 hours.
[NeMo I 2024-03-27 17:20:19 collections:448] # 420 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.25it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:20 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:20 collections:446] Dataset loaded with 535 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:20 collections:448] # 535 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 13.27it/s]
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 722 items, total duration of 0.14 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 722 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 15.12it/s]
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 1106 items, total duration of 0.15 hours.
[NeMo I 2024-03-27 17:20:21 collections:448] # 1106 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 18.48it/s]
[NeMo I 2024-03-27 17:20:22 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.52it/s]
[NeMo I 2024-03-27 17:20:23 clustering_diarizer:464] Outputs are saved in /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs directory
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:0 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:1 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:2 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:3 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:4 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-03-27 17:20:23 msdd_models:938] Loading cluster label file from /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2024-03-27 17:20:23 collections:761] Filtered duration for loading collection is 0.000000.
[NeMo I 2024-03-27 17:20:23 collections:764] Total 1 session files loaded accounting to # 1 audio clips
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.66it/s]
[NeMo I 2024-03-27 17:20:23 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-03-27 17:20:23 msdd_models:1431]

from whisper-diarization.

Audio only part time transcribed and each time a different one? about whisper-diarization HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent