Giter Site home page Giter Site logo

speechbox's Introduction

GitHub release Contributor Covenant

๐Ÿšจ๐Ÿšจ Important: This package is not actively maintained. If you are interested in maintaining this repo, please open an issue so that we can contact you ๐Ÿšจ๐Ÿšจ

๐Ÿค— Speechbox offers a set of speech processing tools, such as punctuation restoration.

Installation

With pip (official package)

pip install speechbox

Contributing

We โค๏ธ contributions from the open-source community! If you want to contribute to this library, please check out our Contribution guide. You can look out for issues you'd like to tackle to contribute to the library.

Also, say ๐Ÿ‘‹ in our public Discord channel Join us on Discord under ML for Audio and Speech. We discuss the new trends about machine learning methods for speech, help each other with contributions, personal projects or just hang out โ˜•.

Tasks

Task Description Author
Punctuation Restoration Punctuation restoration allows one to predict capitalized words as well as punctuation by using Whisper. Patrick von Platen
ASR With Speaker Diarization Transcribe long audio files, such as meeting recordings, with speaker information (who spoke when) and the transcribed text. Sanchit Gandhi

Punctuation Restoration

Punctuation restoration relies on the premise that Whisper can understand universal speech. The model is forced to predict the passed words, but is allowed to capitalized letters, remove or add blank spaces as well as add punctuation. Punctuation is simply defined as the offial Python string.Punctuation characters.

Note: For now this package has only been tested with:

and only on some 80 audio samples of patrickvonplaten/librispeech_asr_dummy.

See some transcribed results here.

Web Demo

If you want to try out the punctuation restoration, you can try out the following ๐Ÿš€ Spaces:

Hugging Face Spaces

Example

In order to use the punctuation restoration task, you need to install Transformers:

pip install --upgrade transformers

For this example, we will additionally make use of datasets to load a sample audio file:

pip install --upgrade datasets

Now we stream a single audio sample, load the punctuation restoring class with "openai/whisper-tiny.en" and add punctuation to the transcription.

from speechbox import PunctuationRestorer
from datasets import load_dataset

streamed_dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# get first sample
sample = next(iter(streamed_dataset))

# print out normalized transcript
print(sample["text"])
# => "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"

# load the restoring class
restorer = PunctuationRestorer.from_pretrained("openai/whisper-tiny.en")
restorer.to("cuda")

restored_text, log_probs = restorer(sample["audio"]["array"], sample["text"], sampling_rate=sample["audio"]["sampling_rate"], num_beams=1)

print("Restored text:\n", restored_text)

See examples/restore for more information.

ASR With Speaker Diarization

Given an unlabelled audio segment, a speaker diarization model is used to predict "who spoke when". These speaker predictions are paired with the output of a speech recognition system (e.g. Whisper) to give speaker-labelled transcriptions.

The combined ASR + Diarization pipeline can be applied directly to long audio samples, such as meeting recordings, to give fully annotated meeting transcriptions.

Web Demo

If you want to try out the ASR + Diarization pipeline, you can try out the following Space:

Hugging Face Spaces

Example

In order to use the ASR + Diarization pipeline, you need to install ๐Ÿค— Transformers and pyannote.audio:

pip install --upgrade transformers pyannote.audio

For this example, we will additionally make use of ๐Ÿค— Datasets to load a sample audio file:

pip install --upgrade datasets

Now we stream a single audio sample, pass it to the ASR + Diarization pipeline, and return the speaker-segmented transcription:

import torch
from speechbox import ASRDiarizationPipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-tiny", device=device)

# load dataset of concatenated LibriSpeech samples
concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
# get first sample
sample = next(iter(concatenated_librispeech))

out = pipeline(sample["audio"])
print(out)

speechbox's People

Contributors

bofenghuang avatar fredhaa avatar hbredin avatar jbjoyce avatar patrickvonplaten avatar sanchit-gandhi avatar utility-aagrawal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speechbox's Issues

An idea for enhancing punctuation restoration for non-space separated languages

Hello! Thank you for creating such a nice project. I was pleasantly surprised to see that you also utilized Whisper for punctuation restoration, as I had the same idea and my implementation is available here: (https://github.com/jumon/whisper-punctuator). In fact, your implementation looks nicer than mine ๐Ÿ˜†

I have one small suggestion for improvement. I noticed that your code splits words by spaces and only inserts punctuation marks after words. This approach may not work well for languages such as Japanese and Chinese, which do not use spaces to indicate word boundaries. It may be beneficial to allow punctuation insertion at other locations or to make it an optional feature.

Thank you!

TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

Hi,

I am getting the following error:
image

When I run ASR_diarization pipeline with the following video:
https://drive.google.com/file/d/1wBwAs5Ubp-70fAf7uwevEu8XFlqksQja/view?usp=sharing

It's happening because the end timestamp of the last segment is None. I found this - huggingface/transformers#23231.

For now, I was able to suppress the error by removing None from end_timestamps array. I understand that I lose some text towards the end of the video but for my use-case, that's not a problem. I wanted to bring this to your attention for a fix in the future. Let me know if you need anything additional. Thanks!

Error with: ASR With Speaker Diarization Example

running into lib version issues on Google colab

Reproduce

  • open Google colab (with GPU instance)
    -run:
 !pip install --upgrade transformers pyannote.audio
!pip install speechbox
!pip install --upgrade datasets


import torch
from speechbox import ASRDiarizationPipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-tiny", device=device)

# load dataset of concatenated LibriSpeech samples
concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
# get first sample
sample = next(iter(concatenated_librispeech))

out = pipeline(sample["audio"])
print(out)

Error


OSError                                   Traceback (most recent call last)
[<ipython-input-2-28ff1705801a>](https://localhost:8080/#) in <cell line: 2>()
      1 import torch
----> 2 from speechbox import ASRDiarizationPipeline
      3 from datasets import load_dataset
      4 
      5 device = "cuda:0" if torch.cuda.is_available() else "cpu"

17 frames
[/usr/lib/python3.9/ctypes/__init__.py](https://localhost:8080/#) in __init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    372 
    373         if handle is None:
--> 374             self._handle = _dlopen(self._name, mode)
    375         else:
    376             self._handle = handle

OSError: /usr/local/lib/python3.9/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops10select_int4callERKNS_6TensorElN3c106SymIntE

Add support for specifying the number of speakers in ASRDiarizationPipeline

Hi @speechbox developers,

I've been using the ASRDiarizationPipeline and noticed that there isn't a built-in option to specify the number of speakers when performing diarization. This feature would be very helpful for scenarios where the number of speakers is already known or can be estimated beforehand, as it can potentially improve the performance of the speaker diarization process.

got an unexpected keyword argument 'use_auth_token'


TypeError Traceback (most recent call last)
in <cell line: 6>()
4
5 device = "cuda:0" if torch.cuda.is_available() else "cpu"
----> 6 pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-tiny", device=device,token='')
7
8 # load dataset of concatenated LibriSpeech samples

2 frames
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/automatic_speech_recognition.py in init(self, model, feature_extractor, tokenizer, decoder, modelcard, framework, task, args_parser, device, torch_dtype, binary_output, **kwargs)
286 self.type = "ctc"
287
--> 288 self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
289
290 mapping = MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES.copy()

TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'use_auth_token'

Anyone have demo source code to process file with whisper large model and get outputs as vtt srt?

Dear @patrickvonplaten thank you very much for this repo
I am using almost every day to generate subtitles for my videos by using Whisper

However, I need its produced vtt file (basically subtitle output format of transcription)

Currently to fix and improve punctuation, I am using fullstop-punctuation-multilang-large (https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) but I can't say it is the best

I would like to test your repo however I need demo for full vtt export

Could you release a demo source code that can output vtt file ? it can have both fixed and raw output of whipser for comparison

Moreover, I have added you from linkedin if you accept i appreciate : https://www.linkedin.com/in/furkangozukara/

One final thing. I am also very interested in stable diffusion and preparing tutorial videos. I hope that you consider adding my tutorial videos to readme here : https://huggingface.co/runwayml/stable-diffusion-v1-5

And perhaps open back this topic so people can learn? thank you : https://huggingface.co/runwayml/stable-diffusion-v1-5/discussions/66

'GenerationConfig' object has no attribute 'no_timestamps_token_id'

When using a fine-tuned Whisper model, running ASRDiarizationPipeline throws an error:

from speechbox import ASRDiarizationPipeline

pipeline = ASRDiarizationPipeline.from_pretrained("pgilles/whisper-large-v2-lb_cased_03", device=device, use_auth_token="***")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-12-17b62c26ae1a>](https://localhost:8080/#) in <module>
----> 1 out = pipeline(audio_file, group_by_speaker=True)
      2 pd.DataFrame(out)

8 frames
[/usr/local/lib/python3.8/dist-packages/transformers/generation/logits_process.py](https://localhost:8080/#) in __init__(self, generate_config)
    934     def __init__(self, generate_config):  # support for the kwargs
    935         self.eos_token_id = generate_config.eos_token_id
--> 936         self.no_timestamps_token_id = generate_config.no_timestamps_token_id
    937         self.timestamp_begin = generate_config.no_timestamps_token_id + 1
    938 

AttributeError: 'GenerationConfig' object has no attribute 'no_timestamps_token_id'

Error with speaker diarization and transcription

ValueError: attempt to get argmin of an empty sequence

code used:
from speechbox import ASRDiarizationPipeline
pipeline = ASRDiarizationPipeline(
asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline)

from transformers import pipeline

asr_pipeline = pipeline(
"automatic-speech-recognition",
model=model,
feature_extractor=feature_extractor,
tokenizer=tokenizer,
device=0
)

Inverse text normalization of numbers etc.

The restorer in speechbox is very good at restoring punctuation and capitalization in sentences without spelled out numbers.

However, for CTC based ASR approaches like Wav2Vec2, numbers are spelled out as the vocabulary usually only contains alphabetical characters. In order to use the restorer on the output of a Wav2Vec2 model, one would therefore need to first do inverse normalization of numericals using e.g. language specific FSTs.

I wonder whether it is possible to use the output of Whisper (which is already alphanumerical) within this method to simultaneously do inverse normalization as well as punctuation and capitalization, thereby creating a universal post-processor for CTC models, which supports all languages included in Whisper.

Here are some examples (in Danish), using the Whisper-medium multilingual model:

An audio clip saying the words:
seksogtres tusind fem hundrede og enoghalvfjerds (66.571)
produces the following output with the restorer:
SEKSOGTRES TUSIND' FEM HUNDREDE OG enoghalvfjerds

otteogtres tusind tre hundrede og toogtyve (68.322)
becomes:
Otteogtres-tusind-trehundredeog-toogtyve

How feasible is it to have ITN as a first step in the method?

Loading a custom audio sample into the diarization pipeline

Hey! First of all, thanks for all the amazing work.

I am trying to get the diarization to work with custom audio samples (i.e audio.mp3 or audio.wav files), and I would like to know how to load them before calling the pipeline.

In particular, I'd like to substitute this sample with my own files:

concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
sample = next(iter(concatenated_librispeech))

Sorry about my ignorance, I'm very used to NodeJS and finding it challenging to follow everything!

AttributeError: 'Annotation' object has no attribute 'for_json'

Tested versions

Library Version
Python 3.12.2
Pyannote.audio 3.1.1
Pyannote.core 5.0.0
speechbox 0.2.1

System information

macOS 14.1 (23B2073) - M3 Max

Issue description

Code

from transformers import pipeline
from pyannote.audio import Pipeline
from speechbox import ASRDiarizationPipeline as ASRDP

diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=os.environ["HUGGINGFACE_TOKEN"])
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
pipe = ASRDP(asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline)
output = pipe("audio.mp3")

Error

[/opt/homebrew/lib/python3.12/site-packages/tqdm/auto.py:21](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/tqdm/auto.py:21): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[/opt/homebrew/lib/python3.12/site-packages/pyannote/audio/core/io.py:43](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/pyannote/audio/core/io.py:43): UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], [line 2](vscode-notebook-cell:?execution_count=8&line=2)
      [1](vscode-notebook-cell:?execution_count=8&line=1) with ProgressHook() as hook:
----> [2](vscode-notebook-cell:?execution_count=8&line=2)     output = pipe("audio.mp3", hook=hook)

File [/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:90](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:90), in ASRDiarizationPipeline.__call__(self, inputs, group_by_speaker, **kwargs)
     [83](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:83) inputs, diarizer_inputs = self.preprocess(inputs)
     [85](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:85) diarization = self.diarization_pipeline(
     [86](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:86)     {"waveform": diarizer_inputs, "sample_rate": self.sampling_rate},
     [87](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:87)     **kwargs,
     [88](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:88) )
---> [90](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:90) segments = diarization.for_json()["content"]
     [92](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:92) # diarizer output may contain consecutive segments from the same speaker (e.g. {(0 -> 1, speaker_1), (1 -> 1.5, speaker_1), ...})
     [93](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:93) # we combine these segments to give overall timestamps for each speaker's turn (e.g. {(0 -> 1.5, speaker_1), ...})
     [94](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py:94) new_segments = []

AttributeError: 'Annotation' object has no attribute 'for_json'

This was tried on Jupyter Notebook on local device as well as on Google Collab. The error remains the same.

Minimal reproduction example (MRE)

AttributeError: 'Annotation' object has no attribute 'for_json'

ASP Diarization Performance

This diarization doesn't compare favorably with Whisper. Wondering if I'm missing something in call parameters or other

On this two minute video - https://www.youtube.com/watch?v=xbyEs7DJshw&ab_channel=HipronarySchool%23Callcenter ,
speechbox segments as 14 speaker transitions as below (vs something between 37 and 45). Code used is pretty trivial but attached as context

--
speaker text
0 SPEAKER_00 Hello, can you take a picture for a spot in h...
1 SPEAKER_01 How don't I have any?
2 SPEAKER_00 Yes, it's ATK 0804949. Okay, just let me veri...
3 SPEAKER_01 You did? For the ninth time, the only is not ...
4 SPEAKER_00 Okay, sir. I totally understand your situatio...
5 SPEAKER_01 Okay, yeah, yeah, we usually then our brother...
6 SPEAKER_00 Could you take a look design the boss and ver...
7 SPEAKER_01 But they're not a cable in the dogs. Okay, le...
8 SPEAKER_00 Sores are my mistake current in system, the e...
9 SPEAKER_01 Okay, but if you have a lot of trouble going ...
10 SPEAKER_00 If you want anything already wrong from us, w...
11 SPEAKER_01 No, I don't need different deals, guys. Thank...
12 SPEAKER_00 For doing your doctor, I will just cut your f...
13 SPEAKER_01 Yeah, yeah, I'll write right now.
14 SPEAKER_00 How night is?

      timestamp  

0 (0.0, 13.2)
1 (13.2, 14.7)
2 (14.7, 53.0)
3 (53.0, 57.0)
4 (57.0, 66.8)
5 (66.8, 71.16)
6 (72.0, 76.8)
7 (77.28, 80.72)
8 (82.56, 101.76)
9 (101.76, 107.0)
10 (107.0, 111.0)
11 (111.0, 114.0)
12 (114.0, 117.0)
13 (117.0, 119.0)
sharpenspeechbrain.py.zip

14 (119.0, 120.0)

OSError: automatic-speech-recognition is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

Tested versions

  • Python 3.12.2
  • Pyannote.audio 3.1.1
  • Pyannote.core 5.0.0
  • speechbox 0.2.1

System information

macOS 14.1 (23B2073) - M3 Max

Issue description

Code

from transformers import pipeline
from pyannote.audio import Pipeline
from speechbox import ASRDiarizationPipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook


pipe = ASRDiarizationPipeline.from_pretrained(
    asr_model="automatic-speech-recognition",
    diarizer_model="pyannote/speaker-diarization-3.1",
    use_auth_token="<token>",
)

with ProgressHook() as hook:
    output = pipe("audio.mp3", hook=hook)
HTTPError                                 Traceback (most recent call last)
File [/opt/homebrew/lib/python3.12/site-packages/huggingface_hub/utils/_errors.py:304](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/huggingface_hub/utils/_errors.py:304), in hf_raise_for_status(response, endpoint_name)
    [303](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/huggingface_hub/utils/_errors.py:303) try:
--> [304](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/huggingface_hub/utils/_errors.py:304)     response.raise_for_status()
    [305](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/huggingface_hub/utils/_errors.py:305) except HTTPError as e:

File [/opt/homebrew/lib/python3.12/site-packages/requests/models.py:1021](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/requests/models.py:1021), in Response.raise_for_status(self)
   [1020](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/requests/models.py:1020) if http_error_msg:
-> [1021](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/requests/models.py:1021)     raise HTTPError(http_error_msg, response=self)

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/automatic-speech-recognition/resolve/main/config.json

The above exception was the direct cause of the following exception:

RepositoryNotFoundError                   Traceback (most recent call last)
File [/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:398](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:398), in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_gated_repo, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs)
    [396](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:396) try:
    [397](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:397)     # Load from URL or cache if already cached
--> [398](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:398)     resolved_file = hf_hub_download(
    [399](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:399)         path_or_repo_id,
    [400](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:400)         filename,
    [401](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:401)         subfolder=None if len(subfolder) == 0 else subfolder,
    [402](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:402)         repo_type=repo_type,
    [403](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:403)         revision=revision,
...
    [431](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:431)         f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
    [432](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/lib/python3.12/site-packages/transformers/utils/hub.py:432)     ) from e

OSError: automatic-speech-recognition is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

How to solve this error

Instead of using asr model name you have used "automatic-speech-recognition" as constant by default which should come inside the pipeline as task_name. Changing that would make it working.

Minimal reproduction example (MRE)

automatic-speech-recognition is not a local folder and is not a valid model

Restore punctuation for audios no 16k

Hi @patrickvonplaten ๐Ÿ‘‹,

Thanks for this project!

I'm thinking we should have a possible audio resampling since WhisperFeatureExtractor doesn't do it inside.

Below is an updated example. But it might be better to have it inside PunctuationRestorer to make it an out-of-box solution. What's your opinion? Willing to make a PR if necessary :)

import string
import re
from datasets import load_dataset
import librosa
from speechbox import PunctuationRestorer

streamed_dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation", streaming=True)

# get first sample
sample = next(iter(streamed_dataset))

# print out normalized transcript
print(sample["sentence"])
# => "It is from Westport, above the villages of Murrisk and Lecanvey."
sentence = re.sub(rf"[{re.escape(string.punctuation)}]", "", sample["sentence"]).lower()
print(sentence)
# => "it is from westport above the villages of murrisk and lecanvey"

# load the restoring class
restorer = PunctuationRestorer.from_pretrained("openai/whisper-tiny.en")
restorer.to("cuda")

# resample audio if necessary
model_sample_rate = restorer.processor.feature_extractor.sampling_rate
if sample["audio"]["sampling_rate"] != model_sample_rate:
    sample["audio"]["array"] = librosa.resample(
        sample["audio"]["array"], orig_sr=sample["audio"]["sampling_rate"], target_sr=model_sample_rate, res_type="kaiser_best"
    )

restored_text, log_probs = restorer(sample["audio"]["array"], sentence, sampling_rate=model_sample_rate, num_beams=1)

print("Restored text:\n", restored_text)
# Restored text:
# It is from Westport above the villages of MURRISK and LECANVEY.

TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'use_auth_token'

from speechbox import ASRDiarizationPipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

pipe = ASRDiarizationPipeline.from_pretrained(asr_model="openai/whisper-base", diarizer_model="pyannote/speaker-diarization-3.1")

with ProgressHook() as hook:
    output = pipe("audio.mp3", hook=hook)

Error

/opt/homebrew/lib/python3.12/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/Users/user/asr/main.py", line 21, in <module>
    pipe = ASRDiarizationPipeline.from_pretrained(asr_model="openai/whisper-base", diarizer_model="pyannote/speaker-diarization-3.1")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/speechbox/diarize.py", line 33, in from_pretrained
    asr_pipeline = pipeline(
                   ^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/pipelines/__init__.py", line 1107, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 220, in __init__
    super().__init__(model, tokenizer, feature_extractor, device=device, torch_dtype=torch_dtype, **kwargs)
  File "/opt/homebrew/lib/python3.12/site-packages/transformers/pipelines/base.py", line 886, in __init__
    self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
                                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'use_auth_token'

Text-only approach to punctuation restoration Pro/Con

I know this project is at an early stage, but I just want to flag an alternative approach to punctuation restoration. It's a package called punctfix, and can be found here (I'm not a contributor to that package). Rather than using Whisper models, they use a NER approach, and works really well and super fast.

ASRDiarizationPipeline processing time

Hey y'all!

Been doing some tests of the diarization pipeline, and unfortunately I'm getting very slow processing times.

Granted, I'm testing it from my Mac M1, but when I transcribe stuff with the "barebones" Whisper I do get faster processing times.

As of right now, I am using an API KEY generated from a free Huggingface account.

Is this slow performance expected? Is there a way to check any potential performance bottlenecks?

Thank you!

Language Selection is Not Available for Whisper Model

Code

import json
from speechbox import ASRDiarizationPipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook

pipe = ASRDiarizationPipeline.from_pretrained(asr_model="openai/whisper-base", diarizer_model="pyannote/speaker-diarization-3.1")

with ProgressHook() as hook:
    output = pipe("audio.mp3", hook=hook)

json.dump(output, open("output.json", "w"))

Output

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.

Question

Where do I specify generate_kwargs = {"language":"Hindi"}?

Unwanted automatic translation of non-english input to diarization.

Currently when using a non-english audio file, speechbox automatically translates the diarization to english.
Whisper has this feature, controlled by the task-argument ('transcribe' vs. 'translate'). I was unable to forward this option to the whisper asr-model, as the keyword task is used for 'automatic-speech-recognition'.
Whisper by itself is fully capable to transcribe the input audio in german into german language - however not speaker diarized.
Is there a way to get around this?

[New Task] Add timestamp alignment

It would be very nice to have a simply tool to align timestamps and audio, something along the lines:

from speechbox import SpeechAligner

aligner = SpeechAligner.from_pretrained(...)

aligner.align(audio=audio, transcript=transcript)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.