Giter Site home page Giter Site logo

huggingface / distil-whisper Goto Github PK

View Code? Open in Web Editor NEW
3.1K 64.0 217.0 2.59 MB

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

License: MIT License

Python 88.55% Makefile 0.03% Shell 11.41%
audio speech-recognition whisper

distil-whisper's Introduction

Distil-Whisper

[Paper] [Models] [Colab] [Training Code]

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets:

Model Params / M Rel. Latency ↑ Short-Form WER ↓ Long-Form WER ↓
large-v3 1550 1.0 8.4 11.0
distil-large-v3 756 6.3 9.7 10.8
distil-large-v2 756 5.8 10.1 11.6
distil-medium.en 394 6.8 11.1 12.4
distil-small.en 166 5.6 12.1 12.8

For most applications, we recommend the latest distil-large-v3 checkpoint, since it is the most performant distilled checkpoint and compatible across all Whisper libraries. The only exception is resource-constrained applications with very little memory, such as on-device or mobile applications, where the distil-small.en is a great choice, since it is only 166M parameters and performs within 4% WER of Whisper large-v3.

Note: Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the provided training code. We will soon update the repository with multilingual checkpoints when ready!

1. Usage

Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

Short-Form Transcription

Short-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the maximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the need for chunking.

First, we load Distil-Whisper via the convenient AutoModelForSpeechSeq2Seq and AutoProcessor classes.

We load the model in float16 precision and make sure that loading time takes as little time as possible by passing low_cpu_mem_usage=True. In addition, we want to make sure that the model is loaded in safetensors format by passing use_safetensors=True:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

The model and processor can then be passed to the pipeline. Note that if you would like to have more control over the generation process, you can directly make use of model + processor API as shown below.

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

Next, we load an example short-form audio from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")
print(result["text"])

For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline docs. We also provide an end-to-end Google Colab that benchmarks Whisper against Distil-Whisper.

For more control over the generation parameters, use the model + processor API directly:

Ad-hoc generation arguments can be passed to model.generate, including num_beams for beam-search, return_timestamps for segment-level timestamps, and prompt_ids for prompting. See the docstrings for more details.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

input_features = processor(
  sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features

input_features = input_features.to(device, dtype=torch_dtype)

gen_kwargs = {
  "max_new_tokens": 128,
  "num_beams": 1,
  "return_timestamps": False,
}

pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])

print(pred_text)

Sequential Long-Form

The latest distil-large-v3 checkpoint is specifically designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the chunked long-form algorithm.

The sequential long-form algorithm should be used in either of the following scenarios:

  1. Transcription accuracy is the most important factor, and latency is less of a consideration
  2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate

If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm described below. For a detailed explanation of the different algorithms, refer to Sections 5 of the Distil-Whisper paper.

We start by loading the model and processor as before:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

The model and processor can then be passed to the pipeline. Note that if you would like to have more control over the generation process, you can directly make use of model.generate(...) API as shown below.

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

Next, we load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")
print(result["text"])
For more control over the generation parameters, use the model + processor API directly:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

Chunked Long-Form

distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the Distil-Whisper paper).

We can load the model and processor as before:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

To enable chunking, pass the chunk_length_s parameter to the pipeline. For distil-large-v3, a chunk length of 25-seconds is optimal. To activate batching, pass the argument batch_size:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

The argument max_new_tokens controls the maximum number of generated tokens per-chunk. In the typical speech setting, we have no more than 3 words spoken per-second. Therefore, for a 30-second input, we have at most 90 words (approx 128 tokens). We set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the end of the segment. These tokens get removed at the chunk borders using the long-form chunking transcription algorithm, so it is more efficient to truncate them directly during generation to avoid redundant generation steps in the decoder.

Now, let's load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline docs.

Speculative Decoding

Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.

For speculative decoding, we need to load both the teacher: openai/whisper-large-v3. As well as the assistant (a.k.a student) distil-whisper/distil-large-v3.

Let's start by loading the teacher model and processor. We do this in much the same way we loaded the Distil-Whisper model in the previous examples:

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Now let's load the assistant. Since Distil-Whisper shares exactly same encoder as the teacher model, we only need to load the 2-layer decoder as a "Decoder-only" model:

from transformers import AutoModelForCausalLM
assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

The assistant model shares the same processor as the teacher, so there's no need to load a student processor.

We can now pass the assistant model to the pipeline to be used for speculative decoding. We pass it as a generate_kwarg with the key "assistant_model" so that speculative decoding is enabled:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

As before, we can pass any sample to the pipeline to be transcribed:

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Note: speculative decoding should be on average 2x faster than using "only" Whisper large-v2 at a mere 8% increase in VRAM memory usage while mathematically ensuring the same results. This makes it the perfect replacement for Whisper large-v2 in existing speech recognition pipelines.

For more details on speculative decoding, refer to the following resources:

Additional Speed & Memory Improvements

You can apply additional speed and memory improvements to Distil-Whisper which we cover in the following.

Flash Attention

We recommend using Flash Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

You can then pass use_flash_attention_2=True to from_pretrained to enable Flash Attention 2:

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)

Torch Scale-Product-Attention (SDPA)

If your GPU does not support Flash Attention, we recommend making use of BetterTransformers. To do so, you first need to install optimum:

pip install --upgrade optimum

And then convert your model to a "BetterTransformer" model before using it:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()

Exporting to Other Libraries

Distil-Whisper has support in the following libraries with the original "sequential" long-form transcription algorithm. Click the links in the table to see the relevant code-snippets for each:

Library distil-small.en distil-medium.en distil-large-v2
OpenAI Whisper link link link
Whisper cpp link link link
Transformers js link link link
Candle (Rust) link link link

Updates will be posted here with the integration of the "chunked" long-form transcription algorithm into the respective libraries.

For the 🤗 Transformers code-examples, refer to the sections Short-Form and Long-Form Transcription.

2. Why use Distil-Whisper? ⁉️

Distil-Whisper is designed to be a drop-in replacement for Whisper on English speech recognition. Here are 5 reasons for making the switch to Distil-Whisper:

  1. Faster inference: 6 times faster inference speed, while performing to within 1% WER of Whisper on out-of-distribution audio:

  1. Robustness to noise: demonstrated by strong WER performance at low signal-to-noise ratios:

  1. Robustness to hallucinations: quantified by 1.3 times fewer repeated 5-gram word duplicates (5-Dup.) and 2.1% lower insertion error rate (IER) than Whisper:

  1. Designed for speculative decoding: Distil-Whisper can be used as an assistant model to Whisper, giving 2 times faster inference speed while mathematically ensuring the same outputs as the Whisper model.
  2. Permissive license: Distil-Whisper is MIT licensed, meaning it can be used for commercial applications.

3. Approach ✍️

To distill Whisper, we copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded:

Distil-Whisper is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on pseudo-labelled audio data.

We train Distil-Whisper on a total of 22k hours of pseudo-labelled audio data, spanning 10 domains with over 18k speakers:

This diverse audio dataset is paramount to ensuring robustness of Distil-Whisper to different datasets and domains.

In addition, we use a WER filter to discard pseudo-labels where Whisper mis-transcribes or hallucinates. This greatly improves WER performance of the downstream distilled model.

For full details on the distillation set-up and evaluation results, refer to the Distil-Whisper paper.

4. Training Code

Training code to reproduce Distil-Whisper can be found in the directory training. This code has been adapted be general enough to distill Whisper for multilingual speech recognition, facilitating anyone in the community to distill Whisper on their choice of language.

5. Acknowledgements

6. Citation

If you use this model, please consider citing the Distil-Whisper paper:

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

And also the Whisper paper:

@misc{radford2022robust,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

distil-whisper's People

Contributors

amrrs avatar bofenghuang avatar eltociear avatar eustlb avatar ibrahimamin1 avatar patrickvonplaten avatar sanchit-gandhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distil-whisper's Issues

Running medium.en model on Jetson Xavier

Hi! I was running the Colab code into a Jetson Xavier platform with CUDA 10.8 and a custom compiled torch 1.8, we can't update Jetpack/CUDA right now due to other limitations but I managed to run the model on GPU, but I got this translations

Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.26k/2.26k [00:00<00:00, 3.27MB/s]
Downloading model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 789M/789M [01:16<00:00, 10.2MB/s]
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 1.12MB/s]
  0%|                                                                                                                                                                    | 0/73 [00:00<?, ?it/s]['AN OKout.. S s카� I s мен reckless gly Hretsor�� enosaurs.']
  1%|██▏                                                                                                                                                         | 1/73 [00:02<03:23,  2.83s/it][' sharing SAN. OKout..el었gram 14ire en mas.']
  3%|████▎                                                                                                                                                       | 2/73 [00:04<02:53,  2.44s/it][' difspakacally yourcompl line I s ad,if describe gче dramaticíp�ak,ict privamin O fra g That douurityys Cabor sщ.']
  4%|██████▍                                                                                                                                                     | 3/73 [00:07<02:46,  2.38s/it][' difist TaiwanSOno toward únicoio Horrorel over S Anaderavort, g G tutorial lroink maybe I drin T And rebell.']
  5%|████████▌                                                                                                                                                   | 4/73 [00:10<02:56,  2.56s/it][' Jetzt chel symbol H a certainly I age تو g Germany nickname, gượcel psychiatric apprentice Hith Yourith aand pairing Barcelona.AN.アk Next SCPel ropes presentedallyies most l s give YeacAN. You beenouöor vida enаст.odyAN.ruct academic love sell en vo.. a treball ہے re sely,�ch exam, very a consequently can l a Ehkl, ver pr,']
  7%|██████████▋                                                                                                                                                 | 5/73 [00:13<03:25,  3.02s/it][' had S neck="osakorerm "ittle accelerated going new sacrifices H,ittle основ l ble.']
  8%|████████████▊                                                                                                                                               | 6/73 [00:16<03:12,  2.87s/it]['� sentlyéd I getting,AN. OKout.. Russiaif helpful elastic HTML.']
 10%|██████████████▉                                                                                                                                             | 7/73 [00:18<02:48,  2.56s/it][' aproximadamentechspak S I a polximoramas gaks l getting S faz unings.']
 11%|█████████████████                                                                                                                                           | 8/73 [00:20<02:42,  2.50s/it][' placeosivingоia, out H Ixt clients, countries gد.']
 12%|███████████████████▏                                                                                                                                        | 9/73 [00:22<02:29,  2.34s/it][' dif that saladys unseen s Dongacist manurn sur баз getting g partlyiqu great symbol,�� aOver Everyoneor s long Jun g sulakac l s gen� I getting, dressed大家都ill s�� Ear here Worldus Police.']
 14%|█████████████████████▏                                                                                                                                     | 10/73 [00:25<02:40,  2.55s/it]['� s wond, g s can Max got accurate obswh Oس re s examples base.']
 15%|███████████████████████▎                                                                                                                                   | 11/73 [00:28<02:32,  2.46s/it]['oundings,ch Sو Aut reANП Michaelos So Leaderac ahow may 안돼 s gift pros I pr, g lidif Liebe meditation l •ät going controversac many怕 I раз']
 16%|█████████████████████████▍                                                                                                                                 | 12/73 [00:30<02:35,  2.54s/it][" 되,idi, enell over prob bitase '."]
 18%|███████████████████████████▌                                                                                                                               | 13/73 [00:33<02:27,  2.47s/it]['AN. OKout..istły en beginning,oschist secretossor onlyки saters engine I recy.']
 19%|█████████████████████████████▋                                                                                                                             | 14/73 [00:35<02:18,  2.35s/it][' con typically OKout.., for.A.']
 21%|███████████████████████████████▊                                                                                                                           | 15/73 [00:37<02:18,  2.38s/it][' currentlyadck desert opt I éléments, s Ukraine Useilanist project a oaby、 a Dark, к partyert B000 forgivefer aations degradation похож.']
 22%|█████████████████████████████████▉                                                                                                                         | 16/73 [00:40<02:23,  2.53s/it][" difistRA gRAos ', Mur 장bero, J then muror conseguirang s 15jin sdoor g theninate s 솔직if most plan."]
 23%|████████████████████████████████████                                                                                                                       | 17/73 [00:42<02:16,  2.43s/it][' Tome organizations a vara Vcess T collaborativeor isies.odyif yourch hasta near g swing s invade哦ith achieith also out then man twelve.']
 25%|██████████████████████████████████████▏                                                                                                                    | 18/73 [00:45<02:26,  2.66s/it][' C maybe imag then man creators,ink�ino surital g esc sley.']
 26%|████████████████████████████████████████▎                                                                                                                  | 19/73 [00:48<02:15,  2.50s/it][' C Direist surely aBy g really / Hcketosad.']
 27%|██████████████████████████████████████████▍                                                                                                                | 20/73 [00:50<02:05,  2.36s/it][' T ostzione��ittle rempeciallyor style tw conf,inkch people So aboutcause.']
 29%|████████████████████████████████████████████▌                                                                                                              | 21/73 [00:52<01:56,  2.24s/it][' T la下or Go really optionsor shoes,inkch people So It.']
 30%|██████████████████████████████████████████████▋                                                                                                            | 22/73 [00:53<01:47,  2.11s/it][' dif copied g 나�fore contributed, contrast somet Dire.']
 32%|████████████████████████████████████████████████▊                                                                                                          | 23/73 [00:56<01:49,  2.20s/it][' Tiritch уri overş��, �ert B000.']
 33%|██████████████████████████████████████████████████▉                                                                                                        | 24/73 [00:58<01:49,  2.23s/it][' dif уri overallyort.']
 34%|█████████████████████████████████████████████████████                                                                                                      | 25/73 [01:00<01:42,  2.14s/it]['oundings,ferel figch G about l newwordityithadyith und Cel knew, hab By Hcause genacro Clarkakor sit yeort nuclear.']
 36%|███████████████████████████████████████████████████████▏                                                                                                   | 26/73 [01:03<01:50,  2.35s/it]['row30,lyien� First culture.']
 37%|█████████████████████████████████████████████████████████▎                                                                                                 | 27/73 [01:05<01:40,  2.18s/it]['謝 Sther optionsting?']
 38%|███████████████████████████████████████████████████████████▍                                                                                               | 28/73 [01:07<01:39,  2.22s/it][' 무서ert B000 l samanке.']
 40%|█████████████████████████████████████████████████████████████▌                                                                                             | 29/73 [01:09<01:32,  2.10s/it]['謝 Sac?']
 41%|███████████████████████████████████████████████████████████████▋                                                                                           | 30/73 [01:10<01:22,  1.93s/it][' Cariке S l s gen기 h Georget, s Make lort undwordity, contrast simple culture.']
 42%|█████████████████████████████████████████████████████████████████▊                                                                                         | 31/73 [01:12<01:24,  2.00s/it][' simple Some Pascal.']
 44%|███████████████████████████████████████████████████████████████████▉                 

etc...

Anyone has some idea on what it's happening? And where should I start looking to fixing it?

Thank you soo much :D

Smaller models?

Unless I've missed something, it's not clear whether the same technique works to accelerate the small.en and smaller whisper models. Is that something you've looked at? If not, would there be any mileage in training it up?

small.en in particular is interesting because it's the biggest that fits onto a raspberry pi zero 2, but isn't quite fast enough for realtime use. Speeding it up would be transformative.

How to make training data?

I have a folder like this:
audio_1
transcript_1.txt
audio_2
transcript_2.txt

how can I make this folder into huggingface dataset?

ggml-distil-small.en.bin slower than ggml-small.en.bin under whisper.cpp

Using the original ggml-small.en.bin on an M1 mac, running whisper.cpp on the hp0.wav sample gives me these timings:

whisper_print_timings:     load time =   469.52 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   144.11 ms
whisper_print_timings:   sample time =  2631.84 ms /  3090 runs (    0.85 ms per run)
whisper_print_timings:   encode time =  7104.00 ms /    13 runs (  546.46 ms per run)
whisper_print_timings:   decode time =   330.89 ms /    25 runs (   13.24 ms per run)
whisper_print_timings:   batchd time =  9753.98 ms /  2982 runs (    3.27 ms per run)
whisper_print_timings:   prompt time =   944.44 ms /  2071 runs (    0.46 ms per run)
whisper_print_timings:    total time = 21415.03 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-small.en.bin -f samples/hp0.wav  7.33s user 0.96s system 38% cpu 21.465 total

Running with ggml-distil-small.en.bin gives the following:

whisper_print_timings:     load time =   333.88 ms
whisper_print_timings:     fallbacks =  13 p /   5 h
whisper_print_timings:      mel time =   155.13 ms
whisper_print_timings:   sample time =  7036.76 ms /  9308 runs (    0.76 ms per run)
whisper_print_timings:   encode time =  6004.30 ms /    11 runs (  545.85 ms per run)
whisper_print_timings:   decode time =   548.74 ms /    95 runs (    5.78 ms per run)
whisper_print_timings:   batchd time = 15798.76 ms /  9098 runs (    1.74 ms per run)
whisper_print_timings:   prompt time =   347.48 ms /  1503 runs (    0.23 ms per run)
whisper_print_timings:    total time = 30275.82 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-distil-small.en.bin -f samples/hp0.wav  17.51s user 1.67s system 63% cpu 30.319 total

Is that expected? It looks like it's faster per pass through the model at every phase, but it needs dramatically more passes.

./main was built here just by calling make, so all config parameters are the defaults.

example using microphone

let's face it. these models are developed with static datasets but a primary use case is streaming audio transcription.

Please include a microphone-based demo (or suffer 1000 github issues begging for it, see other whisper repos)

small uses more memory (but is faster) than medium (ONNX quantized)

Setup

CUDA 12.2
GTX 1080
Copied all ONNX quantized models and required config jsons to their required location.

Code

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from py3nvml.py3nvml import *
from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, pipeline
import torch
import time
import numpy as np

# Initialize NVML
nvmlInit()
# Get the first GPU handle
handle = nvmlDeviceGetHandleByIndex(0)
# Get initial GPU memory usage
info = nvmlDeviceGetMemoryInfo(handle)
initial_gpu_memory = info.used
device = "cuda:0"

# Load models
model_name = '/home/Downloads/asr_models/distil-small-en/onnx_quantized' # Copied all needed files
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_name, export=False, use_safetensors=True)
model.to(device)
tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name)
gpu_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    max_new_tokens=128,
    device=0,
)

# Measurements.
times = []
for _ in range(15):
    start = time.time()
    gpu_pipe(wav_file)
    rtf = (time.time() - start) / (wav_secs)
    times.append(rtf)
times.sort()
times = times[1:-1]
print(np.mean(times))

info = nvmlDeviceGetMemoryInfo(handle)
final_gpu_memory = info.used
gpu_memory_used = (final_gpu_memory - initial_gpu_memory) / 1024 / 1024
print(f"GPU memory used by the code block: {gpu_memory_used} MiB")
nvmlShutdown()

Results

Model RTF Mem Used
distil-medium-en 0.5082902994641076 1878.0 MiB
distil-small-en 0.3903530106270162 2884.0 MiB

Sentence separation

Hey there,

I've noticed that distil-whisper tends to chop sentences in half. It'd be great if it could wrap sentences properly, especially after commas and periods. Switching to word mode and using Spacy sounds a bit like an overkill. Any suggestions on how to fix this?

Accelerate

Accelerate is a tool for multi-machine,but why you use it in single gpu?

How to use ONNX model?

Hello there,

I'm interested in using the ONNX model, as I saw that you are providing the weights for it.
I tried to use it with optimum library, but didn't manage to make it work.
Could someone indicate in which direction I should look into?

Thank you so much for this repository and the work you put into it. It really helps!!

Note:

here is what I tried

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype,  encoder_file_name=f"encoder_model.onnx"
)

Here is the error:

RuntimeError: Too many ONNX model files were found in distil-whisper/distil-large-v2, specify which one to load by using the encoder_file_name argument.

Training without real labels

Is it possible to do the pseudo labeling without access to already transcribed audios?

From what I see in the training scripts, the dataset should have a text column, so it's not possible to just use a bunch of audio to distill a Whisper model.

RuntimeError: Error(s) in loading state_dict for WhisperForConditionalGeneration.

Hi @sanchit-gandhi
I have followed the instruction to do the training but in the training section I get the below error. How to fix?
Traceback (most recent call last):
File "/home/codelex/Documents/lawly/linear_regression/distil-whisper/training/Distil-whisper-mn/create_student_model.py", line 206, in
init_student_model_from_teacher(
File "/home/codelex/Documents/lawly/linear_regression/distil-whisper/training/Distil-whisper-mn/create_student_model.py", line 134, in init_student_model_from_teacher
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for WhisperForConditionalGeneration.
Missing key(s) in state_dict: ['model.encoder.layers.4.self_attn.k_proj.weight', 'model.encoder.layers.4.self_attn.v_proj.weight', 'model.encoder.layers.4.self_attn.v_proj.bias', 'model.encoder.layers.4.self_attn.q_proj.weight', 'model.encoder.layers.4.self_attn.q_proj.bias', 'model.encoder.layers.4.self_attn.out_proj.weight', 'model.encoder.layers.4.self_attn.out_proj.bias', 'model.encoder.layers.4.self_attn_layer_norm.weight', 'model.encoder.layers.4.self_attn_layer_norm.bias', 'model.encoder.layers.4.fc1.weight', 'model.encoder.layers.4.fc1.bias', 'model.encoder.layers.4.fc2.weight', 'model.encoder.layers.4.fc2.bias', 'model.encoder.layers.4.final_layer_norm.weight', 'model.encoder.layers.4.final_layer_norm.bias', 'model.encoder.layers.5.self_attn.k_proj.weight', 'model.encoder.layers.5.self_attn.v_proj.weight', 'model.encoder.layers.5.self_attn.v_proj.bias', 'model.encoder.layers.5.self_attn.q_proj.weight', 'model.encoder.layers.5.self_attn.q_proj.bias', 'model.encoder.layers.5.self_attn.out_proj.weight', 'model.encoder.layers.5.self_attn.out_proj.bias', 'model.encoder.layers.5.self_attn_layer_norm.weight', 'model.encoder.layers.5.self_attn_layer_norm.bias', 'model.encoder.layers.5.fc1.weight', 'model.encoder.layers.5.fc1.bias', 'model.encoder.layers.5.fc2.weight', 'model.encoder.layers.5.fc2.bias', 'model.encoder.layers.5.final_layer_norm.weight', 'model.encoder.layers.5.final_layer_norm.bias', 'model.encoder.layers.6.self_attn.k_proj.weight', 'model.encoder.layers.6.self_attn.v_proj.weight', 'model.encoder.layers.6.self_attn.v_proj.bias', 'model.encoder.layers.6.self_attn.q_proj.weight', 'model.encoder.layers.6.self_attn.q_proj.bias', 'model.encoder.layers.6.self_attn.out_proj.weight', 'model.encoder.layers.6.self_attn.out_proj.bias', 'model.encoder.layers.6.self_attn_layer_norm.weight', 'model.encoder.layers.6.self_attn_layer_norm.bias', 'model.encoder.layers.6.fc1.weight', 'model.encoder.layers.6.fc1.bias', 'model.encoder.layers.6.fc2.weight', 'model.encoder.layers.6.fc2.bias', 'model.encoder.layers.6.final_layer_norm.weight', 'model.encoder.layers.6.final_layer_norm.bias', 'model.encoder.layers.7.self_attn.k_proj.weight', 'model.encoder.layers.7.self_attn.v_proj.weight', 'model.encoder.layers.7.self_attn.v_proj.bias', 'model.encoder.layers.7.self_attn.q_proj.weight', 'model.encoder.layers.7.self_attn.q_proj.bias', 'model.encoder.layers.7.self_attn.out_proj.weight', 'model.encoder.layers.7.self_attn.out_proj.bias', 'model.encoder.layers.7.self_attn_layer_norm.weight', 'model.encoder.layers.7.self_attn_layer_norm.bias', 'model.encoder.layers.7.fc1.weight', 'model.encoder.layers.7.fc1.bias', 'model.encoder.layers.7.fc2.weight', 'model.encoder.layers.7.fc2.bias', 'model.encoder.layers.7.final_layer_norm.weight', 'model.encoder.layers.7.final_layer_norm.bias', 'model.encoder.layers.8.self_attn.k_proj.weight', 'model.encoder.layers.8.self_attn.v_proj.weight', 'model.encoder.layers.8.self_attn.v_proj.bias', 'model.encoder.layers.8.self_attn.q_proj.weight', 'model.encoder.layers.8.self_attn.q_proj.bias', 'model.encoder.layers.8.self_attn.out_proj.weight', 'model.encoder.layers.8.self_attn.out_proj.bias', 'model.encoder.layers.8.self_attn_layer_norm.weight', 'model.encoder.layers.8.self_attn_layer_norm.bias', 'model.encoder.layers.8.fc1.weight', 'model.encoder.layers.8.fc1.bias', 'model.encoder.layers.8.fc2.weight', 'model.encoder.layers.8.fc2.bias', 'model.encoder.layers.8.final_layer_norm.weight', 'model.encoder.layers.8.final_layer_norm.bias', 'model.encoder.layers.9.self_attn.k_proj.weight', 'model.encoder.layers.9.self_attn.v_proj.weight', 'model.encoder.layers.9.self_attn.v_proj.bias', 'model.encoder.layers.9.self_attn.q_proj.weight', 'model.encoder.layers.9.self_attn.q_proj.bias', 'model.encoder.layers.9.self_attn.out_proj.weight', 'model.encoder.layers.9.self_attn.out_proj.bias', 'model.encoder.layers.9.self_attn_layer_norm.weight', 'model.encoder.layers.9.self_attn_layer_norm.bias', 'model.encoder.layers.9.fc1.weight', 'model.encoder.layers.9.fc1.bias', 'model.encoder.layers.9.fc2.weight', 'model.encoder.layers.9.fc2.bias', 'model.encoder.layers.9.final_layer_norm.weight', 'model.encoder.layers.9.final_layer_norm.bias', 'model.encoder.layers.10.self_attn.k_proj.weight', 'model.encoder.layers.10.self_attn.v_proj.weight', 'model.encoder.layers.10.self_attn.v_proj.bias', 'model.encoder.layers.10.self_attn.q_proj.weight', 'model.encoder.layers.10.self_attn.q_proj.bias', 'model.encoder.layers.10.self_attn.out_proj.weight', 'model.encoder.layers.10.self_attn.out_proj.bias', 'model.encoder.layers.10.self_attn_layer_norm.weight', 'model.encoder.layers.10.self_attn_layer_norm.bias', 'model.encoder.layers.10.fc1.weight', 'model.encoder.layers.10.fc1.bias', 'model.encoder.layers.10.fc2.weight', 'model.encoder.layers.10.fc2.bias', 'model.encoder.layers.10.final_layer_norm.weight', 'model.encoder.layers.10.final_layer_norm.bias', 'model.encoder.layers.11.self_attn.k_proj.weight', 'model.encoder.layers.11.self_attn.v_proj.weight', 'model.encoder.layers.11.self_attn.v_proj.bias', 'model.encoder.layers.11.self_attn.q_proj.weight', 'model.encoder.layers.11.self_attn.q_proj.bias', 'model.encoder.layers.11.self_attn.out_proj.weight', 'model.encoder.layers.11.self_attn.out_proj.bias', 'model.encoder.layers.11.self_attn_layer_norm.weight', 'model.encoder.layers.11.self_attn_layer_norm.bias', 'model.encoder.layers.11.fc1.weight', 'model.encoder.layers.11.fc1.bias', 'model.encoder.layers.11.fc2.weight', 'model.encoder.layers.11.fc2.bias', 'model.encoder.layers.11.final_layer_norm.weight', 'model.encoder.layers.11.final_layer_norm.bias', 'model.encoder.layers.12.self_attn.k_proj.weight', 'model.encoder.layers.12.self_attn.v_proj.weight', 'model.encoder.layers.12.self_attn.v_proj.bias', 'model.encoder.layers.12.self_attn.q_proj.weight', 'model.encoder.layers.12.self_attn.q_proj.bias', 'model.encoder.layers.12.self_attn.out_proj.weight', 'model.encoder.layers.12.self_attn.out_proj.bias', 'model.encoder.layers.12.self_attn_layer_norm.weight', 'model.encoder.layers.12.self_attn_layer_norm.bias', 'model.encoder.layers.12.fc1.weight', 'model.encoder.layers.12.fc1.bias', 'model.encoder.layers.12.fc2.weight', 'model.encoder.layers.12.fc2.bias', 'model.encoder.layers.12.final_layer_norm.weight', 'model.encoder.layers.12.final_layer_norm.bias', 'model.encoder.layers.13.self_attn.k_proj.weight', 'model.encoder.layers.13.self_attn.v_proj.weight', 'model.encoder.layers.13.self_attn.v_proj.bias', 'model.encoder.layers.13.self_attn.q_proj.weight', 'model.encoder.layers.13.self_attn.q_proj.bias', 'model.encoder.layers.13.self_attn.out_proj.weight', 'model.encoder.layers.13.self_attn.out_proj.bias', 'model.encoder.layers.13.self_attn_layer_norm.weight', 'model.encoder.layers.13.self_attn_layer_norm.bias', 'model.encoder.layers.13.fc1.weight', 'model.encoder.layers.13.fc1.bias', 'model.encoder.layers.13.fc2.weight', 'model.encoder.layers.13.fc2.bias', 'model.encoder.layers.13.final_layer_norm.weight', 'model.encoder.layers.13.final_layer_norm.bias', 'model.encoder.layers.14.self_attn.k_proj.weight', 'model.encoder.layers.14.self_attn.v_proj.weight', 'model.encoder.layers.14.self_attn.v_proj.bias', 'model.encoder.layers.14.self_attn.q_proj.weight', 'model.encoder.layers.14.self_attn.q_proj.bias', 'model.encoder.layers.14.self_attn.out_proj.weight', 'model.encoder.layers.14.self_attn.out_proj.bias', 'model.encoder.layers.14.self_attn_layer_norm.weight', 'model.encoder.layers.14.self_attn_layer_norm.bias', 'model.encoder.layers.14.fc1.weight', 'model.encoder.layers.14.fc1.bias', 'model.encoder.layers.14.fc2.weight', 'model.encoder.layers.14.fc2.bias', 'model.encoder.layers.14.final_layer_norm.weight', 'model.encoder.layers.14.final_layer_norm.bias', 'model.encoder.layers.15.self_attn.k_proj.weight', 'model.encoder.layers.15.self_attn.v_proj.weight', 'model.encoder.layers.15.self_attn.v_proj.bias', 'model.encoder.layers.15.self_attn.q_proj.weight', 'model.encoder.layers.15.self_attn.q_proj.bias', 'model.encoder.layers.15.self_attn.out_proj.weight', 'model.encoder.layers.15.self_attn.out_proj.bias', 'model.encoder.layers.15.self_attn_layer_norm.weight', 'model.encoder.layers.15.self_attn_layer_norm.bias', 'model.encoder.layers.15.fc1.weight', 'model.encoder.layers.15.fc1.bias', 'model.encoder.layers.15.fc2.weight', 'model.encoder.layers.15.fc2.bias', 'model.encoder.layers.15.final_layer_norm.weight', 'model.encoder.layers.15.final_layer_norm.bias', 'model.encoder.layers.16.self_attn.k_proj.weight', 'model.encoder.layers.16.self_attn.v_proj.weight', 'model.encoder.layers.16.self_attn.v_proj.bias', 'model.encoder.layers.16.self_attn.q_proj.weight', 'model.encoder.layers.16.self_attn.q_proj.bias', 'model.encoder.layers.16.self_attn.out_proj.weight', 'model.encoder.layers.16.self_attn.out_proj.bias', 'model.encoder.layers.16.self_attn_layer_norm.weight', 'model.encoder.layers.16.self_attn_layer_norm.bias', 'model.encoder.layers.16.fc1.weight', 'model.encoder.layers.16.fc1.bias', 'model.encoder.layers.16.fc2.weight', 'model.encoder.layers.16.fc2.bias', 'model.encoder.layers.16.final_layer_norm.weight', 'model.encoder.layers.16.final_layer_norm.bias', 'model.encoder.layers.17.self_attn.k_proj.weight', 'model.encoder.layers.17.self_attn.v_proj.weight', 'model.encoder.layers.17.self_attn.v_proj.bias', 'model.encoder.layers.17.self_attn.q_proj.weight', 'model.encoder.layers.17.self_attn.q_proj.bias', 'model.encoder.layers.17.self_attn.out_proj.weight', 'model.encoder.layers.17.self_attn.out_proj.bias', 'model.encoder.layers.17.self_attn_layer_norm.weight', 'model.encoder.layers.17.self_attn_layer_norm.bias', 'model.encoder.layers.17.fc1.weight', 'model.encoder.layers.17.fc1.bias', 'model.encoder.layers.17.fc2.weight', 'model.encoder.layers.17.fc2.bias', 'model.encoder.layers.17.final_layer_norm.weight', 'model.encoder.layers.17.final_layer_norm.bias', 'model.encoder.layers.18.self_attn.k_proj.weight', 'model.encoder.layers.18.self_attn.v_proj.weight', 'model.encoder.layers.18.self_attn.v_proj.bias', 'model.encoder.layers.18.self_attn.q_proj.weight', 'model.encoder.layers.18.self_attn.q_proj.bias', 'model.encoder.layers.18.self_attn.out_proj.weight', 'model.encoder.layers.18.self_attn.out_proj.bias', 'model.encoder.layers.18.self_attn_layer_norm.weight', 'model.encoder.layers.18.self_attn_layer_norm.bias', 'model.encoder.layers.18.fc1.weight', 'model.encoder.layers.18.fc1.bias', 'model.encoder.layers.18.fc2.weight', 'model.encoder.layers.18.fc2.bias', 'model.encoder.layers.18.final_layer_norm.weight', 'model.encoder.layers.18.final_layer_norm.bias', 'model.encoder.layers.19.self_attn.k_proj.weight', 'model.encoder.layers.19.self_attn.v_proj.weight', 'model.encoder.layers.19.self_attn.v_proj.bias', 'model.encoder.layers.19.self_attn.q_proj.weight', 'model.encoder.layers.19.self_attn.q_proj.bias', 'model.encoder.layers.19.self_attn.out_proj.weight', 'model.encoder.layers.19.self_attn.out_proj.bias', 'model.encoder.layers.19.self_attn_layer_norm.weight', 'model.encoder.layers.19.self_attn_layer_norm.bias', 'model.encoder.layers.19.fc1.weight', 'model.encoder.layers.19.fc1.bias', 'model.encoder.layers.19.fc2.weight', 'model.encoder.layers.19.fc2.bias', 'model.encoder.layers.19.final_layer_norm.weight', 'model.encoder.layers.19.final_layer_norm.bias', 'model.encoder.layers.20.self_attn.k_proj.weight', 'model.encoder.layers.20.self_attn.v_proj.weight', 'model.encoder.layers.20.self_attn.v_proj.bias', 'model.encoder.layers.20.self_attn.q_proj.weight', 'model.encoder.layers.20.self_attn.q_proj.bias', 'model.encoder.layers.20.self_attn.out_proj.weight', 'model.encoder.layers.20.self_attn.out_proj.bias', 'model.encoder.layers.20.self_attn_layer_norm.weight', 'model.encoder.layers.20.self_attn_layer_norm.bias', 'model.encoder.layers.20.fc1.weight', 'model.encoder.layers.20.fc1.bias', 'model.encoder.layers.20.fc2.weight', 'model.encoder.layers.20.fc2.bias', 'model.encoder.layers.20.final_layer_norm.weight', 'model.encoder.layers.20.final_layer_norm.bias', 'model.encoder.layers.21.self_attn.k_proj.weight', 'model.encoder.layers.21.self_attn.v_proj.weight', 'model.encoder.layers.21.self_attn.v_proj.bias', 'model.encoder.layers.21.self_attn.q_proj.weight', 'model.encoder.layers.21.self_attn.q_proj.bias', 'model.encoder.layers.21.self_attn.out_proj.weight', 'model.encoder.layers.21.self_attn.out_proj.bias', 'model.encoder.layers.21.self_attn_layer_norm.weight', 'model.encoder.layers.21.self_attn_layer_norm.bias', 'model.encoder.layers.21.fc1.weight', 'model.encoder.layers.21.fc1.bias', 'model.encoder.layers.21.fc2.weight', 'model.encoder.layers.21.fc2.bias', 'model.encoder.layers.21.final_layer_norm.weight', 'model.encoder.layers.21.final_layer_norm.bias', 'model.encoder.layers.22.self_attn.k_proj.weight', 'model.encoder.layers.22.self_attn.v_proj.weight', 'model.encoder.layers.22.self_attn.v_proj.bias', 'model.encoder.layers.22.self_attn.q_proj.weight', 'model.encoder.layers.22.self_attn.q_proj.bias', 'model.encoder.layers.22.self_attn.out_proj.weight', 'model.encoder.layers.22.self_attn.out_proj.bias', 'model.encoder.layers.22.self_attn_layer_norm.weight', 'model.encoder.layers.22.self_attn_layer_norm.bias', 'model.encoder.layers.22.fc1.weight', 'model.encoder.layers.22.fc1.bias', 'model.encoder.layers.22.fc2.weight', 'model.encoder.layers.22.fc2.bias', 'model.encoder.layers.22.final_layer_norm.weight', 'model.encoder.layers.22.final_layer_norm.bias', 'model.encoder.layers.23.self_attn.k_proj.weight', 'model.encoder.layers.23.self_attn.v_proj.weight', 'model.encoder.layers.23.self_attn.v_proj.bias', 'model.encoder.layers.23.self_attn.q_proj.weight', 'model.encoder.layers.23.self_attn.q_proj.bias', 'model.encoder.layers.23.self_attn.out_proj.weight', 'model.encoder.layers.23.self_attn.out_proj.bias', 'model.encoder.layers.23.self_attn_layer_norm.weight', 'model.encoder.layers.23.self_attn_layer_norm.bias', 'model.encoder.layers.23.fc1.weight', 'model.encoder.layers.23.fc1.bias', 'model.encoder.layers.23.fc2.weight', 'model.encoder.layers.23.fc2.bias', 'model.encoder.layers.23.final_layer_norm.weight', 'model.encoder.layers.23.final_layer_norm.bias', 'model.encoder.layers.24.self_attn.k_proj.weight', 'model.encoder.layers.24.self_attn.v_proj.weight', 'model.encoder.layers.24.self_attn.v_proj.bias', 'model.encoder.layers.24.self_attn.q_proj.weight', 'model.encoder.layers.24.self_attn.q_proj.bias', 'model.encoder.layers.24.self_attn.out_proj.weight', 'model.encoder.layers.24.self_attn.out_proj.bias', 'model.encoder.layers.24.self_attn_layer_norm.weight', 'model.encoder.layers.24.self_attn_layer_norm.bias', 'model.encoder.layers.24.fc1.weight', 'model.encoder.layers.24.fc1.bias', 'model.encoder.layers.24.fc2.weight', 'model.encoder.layers.24.fc2.bias', 'model.encoder.layers.24.final_layer_norm.weight', 'model.encoder.layers.24.final_layer_norm.bias', 'model.encoder.layers.25.self_attn.k_proj.weight', 'model.encoder.layers.25.self_attn.v_proj.weight', 'model.encoder.layers.25.self_attn.v_proj.bias', 'model.encoder.layers.25.self_attn.q_proj.weight', 'model.encoder.layers.25.self_attn.q_proj.bias', 'model.encoder.layers.25.self_attn.out_proj.weight', 'model.encoder.layers.25.self_attn.out_proj.bias', 'model.encoder.layers.25.self_attn_layer_norm.weight', 'model.encoder.layers.25.self_attn_layer_norm.bias', 'model.encoder.layers.25.fc1.weight', 'model.encoder.layers.25.fc1.bias', 'model.encoder.layers.25.fc2.weight', 'model.encoder.layers.25.fc2.bias', 'model.encoder.layers.25.final_layer_norm.weight', 'model.encoder.layers.25.final_layer_norm.bias', 'model.encoder.layers.26.self_attn.k_proj.weight', 'model.encoder.layers.26.self_attn.v_proj.weight', 'model.encoder.layers.26.self_attn.v_proj.bias', 'model.encoder.layers.26.self_attn.q_proj.weight', 'model.encoder.layers.26.self_attn.q_proj.bias', 'model.encoder.layers.26.self_attn.out_proj.weight', 'model.encoder.layers.26.self_attn.out_proj.bias', 'model.encoder.layers.26.self_attn_layer_norm.weight', 'model.encoder.layers.26.self_attn_layer_norm.bias', 'model.encoder.layers.26.fc1.weight', 'model.encoder.layers.26.fc1.bias', 'model.encoder.layers.26.fc2.weight', 'model.encoder.layers.26.fc2.bias', 'model.encoder.layers.26.final_layer_norm.weight', 'model.encoder.layers.26.final_layer_norm.bias', 'model.encoder.layers.27.self_attn.k_proj.weight', 'model.encoder.layers.27.self_attn.v_proj.weight', 'model.encoder.layers.27.self_attn.v_proj.bias', 'model.encoder.layers.27.self_attn.q_proj.weight', 'model.encoder.layers.27.self_attn.q_proj.bias', 'model.encoder.layers.27.self_attn.out_proj.weight', 'model.encoder.layers.27.self_attn.out_proj.bias', 'model.encoder.layers.27.self_attn_layer_norm.weight', 'model.encoder.layers.27.self_attn_layer_norm.bias', 'model.encoder.layers.27.fc1.weight', 'model.encoder.layers.27.fc1.bias', 'model.encoder.layers.27.fc2.weight', 'model.encoder.layers.27.fc2.bias', 'model.encoder.layers.27.final_layer_norm.weight', 'model.encoder.layers.27.final_layer_norm.bias', 'model.encoder.layers.28.self_attn.k_proj.weight', 'model.encoder.layers.28.self_attn.v_proj.weight', 'model.encoder.layers.28.self_attn.v_proj.bias', 'model.encoder.layers.28.self_attn.q_proj.weight', 'model.encoder.layers.28.self_attn.q_proj.bias', 'model.encoder.layers.28.self_attn.out_proj.weight', 'model.encoder.layers.28.self_attn.out_proj.bias', 'model.encoder.layers.28.self_attn_layer_norm.weight', 'model.encoder.layers.28.self_attn_layer_norm.bias', 'model.encoder.layers.28.fc1.weight', 'model.encoder.layers.28.fc1.bias', 'model.encoder.layers.28.fc2.weight', 'model.encoder.layers.28.fc2.bias', 'model.encoder.layers.28.final_layer_norm.weight', 'model.encoder.layers.28.final_layer_norm.bias', 'model.encoder.layers.29.self_attn.k_proj.weight', 'model.encoder.layers.29.self_attn.v_proj.weight', 'model.encoder.layers.29.self_attn.v_proj.bias', 'model.encoder.layers.29.self_attn.q_proj.weight', 'model.encoder.layers.29.self_attn.q_proj.bias', 'model.encoder.layers.29.self_attn.out_proj.weight', 'model.encoder.layers.29.self_attn.out_proj.bias', 'model.encoder.layers.29.self_attn_layer_norm.weight', 'model.encoder.layers.29.self_attn_layer_norm.bias', 'model.encoder.layers.29.fc1.weight', 'model.encoder.layers.29.fc1.bias', 'model.encoder.layers.29.fc2.weight', 'model.encoder.layers.29.fc2.bias', 'model.encoder.layers.29.final_layer_norm.weight', 'model.encoder.layers.29.final_layer_norm.bias', 'model.encoder.layers.30.self_attn.k_proj.weight', 'model.encoder.layers.30.self_attn.v_proj.weight', 'model.encoder.layers.30.self_attn.v_proj.bias', 'model.encoder.layers.30.self_attn.q_proj.weight', 'model.encoder.layers.30.self_attn.q_proj.bias', 'model.encoder.layers.30.self_attn.out_proj.weight', 'model.encoder.layers.30.self_attn.out_proj.bias', 'model.encoder.layers.30.self_attn_layer_norm.weight', 'model.encoder.layers.30.self_attn_layer_norm.bias', 'model.encoder.layers.30.fc1.weight', 'model.encoder.layers.30.fc1.bias', 'model.encoder.layers.30.fc2.weight', 'model.encoder.layers.30.fc2.bias', 'model.encoder.layers.30.final_layer_norm.weight', 'model.encoder.layers.30.final_layer_norm.bias', 'model.encoder.layers.31.self_attn.k_proj.weight', 'model.encoder.layers.31.self_attn.v_proj.weight', 'model.encoder.layers.31.self_attn.v_proj.bias', 'model.encoder.layers.31.self_attn.q_proj.weight', 'model.encoder.layers.31.self_attn.q_proj.bias', 'model.encoder.layers.31.self_attn.out_proj.weight', 'model.encoder.layers.31.self_attn.out_proj.bias', 'model.encoder.layers.31.self_attn_layer_norm.weight', 'model.encoder.layers.31.self_attn_layer_norm.bias', 'model.encoder.layers.31.fc1.weight', 'model.encoder.layers.31.fc1.bias', 'model.encoder.layers.31.fc2.weight', 'model.encoder.layers.31.fc2.bias', 'model.encoder.layers.31.final_layer_norm.weight', 'model.encoder.layers.31.final_layer_norm.bias']

[Question] Mixed Speech transcription

Is it possible to fine-tune Whisper/Distil-Whisper to achieve mixed speech transcription like Hindi+English in a single sentence which is common in casual conversations. Has anyone tried this before? Would training on a mixture of Hindi datasets and English datasets work?
I recently used a fine-tuned Whisper for ASR and it ended up hallucinating and adding additional text which I haven't been able to fix yet.

distil-whisper doesn't work as a drop-in replacement for whisper

If one of the goals of distil-whisper is to be a drop in replace of whisper models (1) it would be interesting to be able to cast it to an object of type whisper.Whisper(2), so that it could be used with any custom decoder implemented for whisper.

Practical issue

I faced a few problems when trying to use the model in stable_whisper(3).

from transformers import WhisperForConditionalGeneration
import whisper
import stable_whisper
pt = WhisperForConditionalGeneration.from_pretrained('distil-whisper/distil-medium.en')
stable_whisper.modify_model(pt.model)
audio = whisper.load_audio('sample.mp3')
pt.model.transcribe(audio)

The first issue is that it doesn't have the dims and is_multilingual properties.

pt.model.dims = whisper.load_model('large-v2').dims
pt.model.is_multilingual = False

That gives AttributeError: 'BaseModelOutput' object has no attribute 'dtype'

Next I tried to load the state_dict to a whisper model, but it doesn't work either

	Missing key(s) in state_dict: "encoder.positional_embedding", "encoder.blocks.0.attn.query.weight", "encoder.blocks.0.attn.query.bias", "encoder.blocks.0.attn.key.weight", , ...
	Unexpected key(s) in state_dict: "encoder.embed_positions.weight", "encoder.layers.0.self_attn.k_proj.weight", "encoder.layers.0.self_attn.v_proj.weight", ...

In summary

What would it take to cast a the distil model to a whisper.Whisper so that they can be a drop in alternative for a broader set of applications?

Compatibility with CTranslate2 / faster-whisper

Great work!

I was wondering whether the distilled version might still be compatible with CTranslate2 / faster-whisper? I understand the changes to the decoder might require some changes there, not to mention speculative decoding.

Thanks,
Ewald

[Speculative Decoding] How to run speculative decoding for batch_size > 1?

Transformers 4.35 only supports speculative decoding for batch size == 1. In order to use speculative decoding for batch size > 1, please make sure to use this branch: huggingface/transformers#26875

To do so, you need to install transformers as follows:

pip install git+https://github.com/huggingface/transformers.git@assistant_decoding_batch

and then you can run:

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    chunk_length_s=15,
    batch_size=4,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "default", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

The PR will be merged to Transformers soon.

Note: Given the "speculative" nature of assistant decoding (a.k.a speculative decoding), it is not recommended to make use of speculative decoding for batch sizes higher than 4 as this might actually lead to the transcription pipeline being slower compared to just using the teacher model.
Confer with Table 22 of the paper.

Tiny model?

Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.

Can fine tuning for phoneme task?

Has anyone experimented with fine-tuning the phoneme recognition task (English), please share some of your experiments.
Many thanks !

english only for large-v2?

Wondering if the statement on the readme is correct "drop-in replacement for Whisper on English speech recognition" - does this mean even large-v2 model is english only? Thanks!

High inference time when using chunk size 15

Hi @sanchit-gandhi !

I'm in the process of integrating multiple whisper backends into a unified package that includes VAD-based chunking. During testing, I observed significantly higher inference times while using the HuggingFace pipeline with distil-whisper. You can find the details here: https://github.com/shashikg/WhisperS2T/releases/tag/v1.1.0 [A30 GPU]

Could you please review the benchmarking script I'm using? It's available at: https://github.com/shashikg/WhisperS2T/blob/main/scripts/benchmark_huggingface_distil.py

Thanks for your assistance!

Shashi

beamsize > 1

How to deal with the situations when beamsize > 1 ?

Long-Form transcription with Faster Whisper

Hi, I have been working on faster whisper and trying to use the distil-whisper model. However, distil-whisper supports 30s of audio chunks and using it with faster whisper only outputs the first 30 seconds.

How can it be used with the faster-whisper implementation?

Export subtitle format of vtt

whisper supports outputting multiple file formats, currently txt,vtt,srt,tsv,json,all, does the project have the ability to output multiple formats? Here is the whisper command line help, you can specify the output format by --output_format.

whisper --help
usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}]
               [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
               [--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patience PATIENCE]
               [--length_penalty LENGTH_PENALTY] [--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]
               [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]
               [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
               [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD] [--logprob_threshold LOGPROB_THRESHOLD]
               [--no_speech_threshold NO_SPEECH_THRESHOLD] [--word_timestamps WORD_TIMESTAMPS]
               [--prepend_punctuations PREPEND_PUNCTUATIONS] [--append_punctuations APPEND_PUNCTUATIONS]
               [--highlight_words HIGHLIGHT_WORDS] [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]
               [--threads THREADS]
               audio [audio ...]

Missing config.json file

I am trying to run the following code:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)

But I am getting the following error:

OSError: distil-whisper/distil-large-v2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/distil-whisper/distil-large-v2/main' for available files.

Am I doing something wrong?

Btw, the distil-whisper instructions says Distil-Whisper is supported in Hugging Face Transformers from version 4.35 onwards, but I wasn't able to find that version: https://pypi.org/project/transformers/#history

[For your information] Run onnx models of distil-whisper with sherpa-onnx

FYI: We have supported exporting distil-whisper via onnx and run it with sherpa-onnx

You can find a colab notebook below for illustration.
Open In Colab

sherpa-onnx is implemented in C++ and provides various APIs for different languages, e.g., Python/C#/Go/C/Kotlin/Swift/C#, etc.
It supports Windows/Linux/macOS and Android/iOS/Raspberry Pi, etc.

The current medium model is still very large and its RTF is greater than 1 on CPU.

Hope that tiny/base/small models will be available soon.

Streaming Implementation of Distil Whisper

Is it possible to use Distil Whisper for streaming applications?

Can I send small chunks of audio files to a server for processing?

Use Case - I wish to use my mic to send small chunks of audio files to a server and process them quickly to get back the transcripts

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.