huggingface / distil-whisper Goto Github PK

View Code? Open in Web Editor NEW

3.1K 64.0 217.0 2.59 MB

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

License: MIT License

Python 88.55% Makefile 0.03% Shell 11.41%

audio speech-recognition whisper

distil-whisper's Introduction

Distil-Whisper

[Paper] [Models] [Colab] [Training Code]

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets:

Model	Params / M	Rel. Latency ↑	Short-Form WER ↓	Long-Form WER ↓
large-v3	1550	1.0	8.4	11.0

distil-large-v3	756	6.3	9.7	10.8
distil-large-v2	756	5.8	10.1	11.6
distil-medium.en	394	6.8	11.1	12.4
distil-small.en	166	5.6	12.1	12.8

For most applications, we recommend the latest distil-large-v3 checkpoint, since it is the most performant distilled checkpoint and compatible across all Whisper libraries. The only exception is resource-constrained applications with very little memory, such as on-device or mobile applications, where the distil-small.en is a great choice, since it is only 166M parameters and performs within 4% WER of Whisper large-v3.

Note: Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the provided training code. We will soon update the repository with multilingual checkpoints when ready!

1. Usage

Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

Short-Form Transcription

Short-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the maximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the need for chunking.

First, we load Distil-Whisper via the convenient AutoModelForSpeechSeq2Seq and AutoProcessor classes.

We load the model in float16 precision and make sure that loading time takes as little time as possible by passing low_cpu_mem_usage=True. In addition, we want to make sure that the model is loaded in safetensors format by passing use_safetensors=True:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

The model and processor can then be passed to the pipeline. Note that if you would like to have more control over the generation process, you can directly make use of model + processor API as shown below.

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

Next, we load an example short-form audio from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")
print(result["text"])

For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline docs. We also provide an end-to-end Google Colab that benchmarks Whisper against Distil-Whisper.

For more control over the generation parameters, use the model + processor API directly:

Ad-hoc generation arguments can be passed to model.generate, including num_beams for beam-search, return_timestamps for segment-level timestamps, and prompt_ids for prompting. See the docstrings for more details.

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

input_features = processor(
  sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features

input_features = input_features.to(device, dtype=torch_dtype)

gen_kwargs = {
  "max_new_tokens": 128,
  "num_beams": 1,
  "return_timestamps": False,
}

pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])

print(pred_text)

Sequential Long-Form

The latest distil-large-v3 checkpoint is specifically designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the chunked long-form algorithm.

The sequential long-form algorithm should be used in either of the following scenarios:

Transcription accuracy is the most important factor, and latency is less of a consideration
You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate

If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm described below. For a detailed explanation of the different algorithms, refer to Sections 5 of the Distil-Whisper paper.

We start by loading the model and processor as before:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

The model and processor can then be passed to the pipeline. Note that if you would like to have more control over the generation process, you can directly make use of model.generate(...) API as shown below.

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

Next, we load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")
print(result["text"])

For more control over the generation parameters, use the model + processor API directly:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]

inputs = processor(
    sample["array"],
    sampling_rate=sample["sampling_rate"],
    return_tensors="pt",
    truncation=False,
    padding="longest",
    return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)

gen_kwargs = {
    "max_new_tokens": 448,
    "num_beams": 1,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
}

pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

print(pred_text)

Chunked Long-Form

distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the Distil-Whisper paper).

We can load the model and processor as before:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

To enable chunking, pass the chunk_length_s parameter to the pipeline. For distil-large-v3, a chunk length of 25-seconds is optimal. To activate batching, pass the argument batch_size:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=25,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

The argument max_new_tokens controls the maximum number of generated tokens per-chunk. In the typical speech setting, we have no more than 3 words spoken per-second. Therefore, for a 30-second input, we have at most 90 words (approx 128 tokens). We set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the end of the segment. These tokens get removed at the chunk borders using the long-form chunking transcription algorithm, so it is more efficient to truncate them directly during generation to avoid redundant generation steps in the decoder.

Now, let's load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:

from datasets import load_dataset

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

Finally, we can call the pipeline to transcribe the audio:

result = pipe(sample)
print(result["text"])

For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline docs.

Speculative Decoding

Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.

For speculative decoding, we need to load both the teacher: openai/whisper-large-v3. As well as the assistant (a.k.a student) distil-whisper/distil-large-v3.

Let's start by loading the teacher model and processor. We do this in much the same way we loaded the Distil-Whisper model in the previous examples:

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Now let's load the assistant. Since Distil-Whisper shares exactly same encoder as the teacher model, we only need to load the 2-layer decoder as a "Decoder-only" model:

from transformers import AutoModelForCausalLM
assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

The assistant model shares the same processor as the teacher, so there's no need to load a student processor.

We can now pass the assistant model to the pipeline to be used for speculative decoding. We pass it as a generate_kwarg with the key "assistant_model" so that speculative decoding is enabled:

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

As before, we can pass any sample to the pipeline to be transcribed:

from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Note: speculative decoding should be on average 2x faster than using "only" Whisper large-v2 at a mere 8% increase in VRAM memory usage while mathematically ensuring the same results. This makes it the perfect replacement for Whisper large-v2 in existing speech recognition pipelines.

For more details on speculative decoding, refer to the following resources:

Speculative decoding for 2x faster Whisper inference blog post by Sanchit Gandhi
Assisted Generation: a new direction toward low-latency text generation blog post by Joao Gante
Fast Inference from Transformers via Speculative Decoding paper by Leviathan et. al.

Additional Speed & Memory Improvements

You can apply additional speed and memory improvements to Distil-Whisper which we cover in the following.

Flash Attention

We recommend using Flash Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention:

pip install flash-attn --no-build-isolation

You can then pass use_flash_attention_2=True to from_pretrained to enable Flash Attention 2:

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)

Torch Scale-Product-Attention (SDPA)

If your GPU does not support Flash Attention, we recommend making use of BetterTransformers. To do so, you first need to install optimum:

pip install --upgrade optimum

And then convert your model to a "BetterTransformer" model before using it:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()

Exporting to Other Libraries

Distil-Whisper has support in the following libraries with the original "sequential" long-form transcription algorithm. Click the links in the table to see the relevant code-snippets for each:

Library	distil-small.en	distil-medium.en	distil-large-v2
OpenAI Whisper	link	link	link
Whisper cpp	link	link	link
Transformers js	link	link	link
Candle (Rust)	link	link	link

Updates will be posted here with the integration of the "chunked" long-form transcription algorithm into the respective libraries.

For the 🤗 Transformers code-examples, refer to the sections Short-Form and Long-Form Transcription.

2. Why use Distil-Whisper? ⁉️

Distil-Whisper is designed to be a drop-in replacement for Whisper on English speech recognition. Here are 5 reasons for making the switch to Distil-Whisper:

Faster inference: 6 times faster inference speed, while performing to within 1% WER of Whisper on out-of-distribution audio:

Robustness to noise: demonstrated by strong WER performance at low signal-to-noise ratios:

Robustness to hallucinations: quantified by 1.3 times fewer repeated 5-gram word duplicates (5-Dup.) and 2.1% lower insertion error rate (IER) than Whisper:

Designed for speculative decoding: Distil-Whisper can be used as an assistant model to Whisper, giving 2 times faster inference speed while mathematically ensuring the same outputs as the Whisper model.
Permissive license: Distil-Whisper is MIT licensed, meaning it can be used for commercial applications.

3. Approach ✍️

To distill Whisper, we copy the entire encoder module and freeze it during training. We copy only two decoder layers, which are initialised from the first and last decoder layers from Whisper. All other decoder layers from Whisper are discarded:

Distil-Whisper is trained on a knowledge distillation objective. Specifically, it is trained to minimise the KL divergence between the distilled model and the Whisper model, as well as the cross-entropy loss on pseudo-labelled audio data.

We train Distil-Whisper on a total of 22k hours of pseudo-labelled audio data, spanning 10 domains with over 18k speakers:

This diverse audio dataset is paramount to ensuring robustness of Distil-Whisper to different datasets and domains.

In addition, we use a WER filter to discard pseudo-labels where Whisper mis-transcribes or hallucinates. This greatly improves WER performance of the downstream distilled model.

For full details on the distillation set-up and evaluation results, refer to the Distil-Whisper paper.

4. Training Code

Training code to reproduce Distil-Whisper can be found in the directory training. This code has been adapted be general enough to distill Whisper for multilingual speech recognition, facilitating anyone in the community to distill Whisper on their choice of language.

5. Acknowledgements

OpenAI for the Whisper model and original codebase
Hugging Face 🤗 Transformers for the model integration
Google's TPU Research Cloud (TRC) program for Cloud TPU v4s

6. Citation

If you use this model, please consider citing the Distil-Whisper paper:

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

And also the Whisper paper:

@misc{radford2022robust,
      title={Robust Speech Recognition via Large-Scale Weak Supervision}, 
      author={Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
      year={2022},
      eprint={2212.04356},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

distil-whisper's People

Contributors

Stargazers

Watchers

Forkers

ibrahimamin1 amrrs xantin mz0in tomchapin saddam213 techthiyanes williamtran29 ssusantachary jaedukseo akabraham choltha primedeviation frankchen-um kicgit richardsonjf chenxwh vongdamthi tngamemo di-dimmasik gary109 joshua-shepherd f901107 soon14 hiimsonkul vanishpool bobokf jiltseb octag0no xgithubzero lwd-temp linkinng daleboy yanxg annias 0xp0 peytontolbert mexicanamerican zhuoranmusic positioner abalmush eltociear tspannhw maxmax2016 dieface keyman9848 xymfei a7mad-magdy77 himorik tukukali benagastov liuxing9848 jade2290 nathfavour mhussar oldpudding chiyee automationkit alejandrosuarez haikuoxin bqdove charlenchen positivewon techsd ai-dialogos-chatbot-with-llms zf0x00 jo-dean pickyouup genexis-ai sevens2k gradient-ai almakedon sm-da javierortegap entn-at ishine marco-cheung meeedooooo imvladikon olanigan shahukareem hajarmazaheri lihuibng widyhu2020 abdulqahar47 kekewind bofenghuang jasper06 nasa03 cent-wsg fusemuskan huyenxam haubui17062019 maidaritsydenov epochkc liyingjie1991 terrylmgomez vvvm23 vitapor econstar

distil-whisper's Issues

Running medium.en model on Jetson Xavier

Hi! I was running the Colab code into a Jetson Xavier platform with CUDA 10.8 and a custom compiled torch 1.8, we can't update Jetpack/CUDA right now due to other limitations but I managed to run the model on GPU, but I got this translations

Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.26k/2.26k [00:00<00:00, 3.27MB/s]
Downloading model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 789M/789M [01:16<00:00, 10.2MB/s]
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 1.12MB/s]
  0%|                                                                                                                                                                    | 0/73 [00:00<?, ?it/s]['AN OKout.. S s카� I s мен reckless gly Hretsor�� enosaurs.']
  1%|██▏                                                                                                                                                         | 1/73 [00:02<03:23,  2.83s/it][' sharing SAN. OKout..el었gram 14ire en mas.']
  3%|████▎                                                                                                                                                       | 2/73 [00:04<02:53,  2.44s/it][' difspakacally yourcompl line I s ad,if describe gче dramaticíp�ak,ict privamin O fra g That douurityys Cabor sщ.']
  4%|██████▍                                                                                                                                                     | 3/73 [00:07<02:46,  2.38s/it][' difist TaiwanSOno toward únicoio Horrorel over S Anaderavort, g G tutorial lroink maybe I drin T And rebell.']
  5%|████████▌                                                                                                                                                   | 4/73 [00:10<02:56,  2.56s/it][' Jetzt chel symbol H a certainly I age تو g Germany nickname, gượcel psychiatric apprentice Hith Yourith aand pairing Barcelona.AN.アk Next SCPel ropes presentedallyies most l s give YeacAN. You beenouöor vida enаст.odyAN.ruct academic love sell en vo.. a treball ہے re sely,�ch exam, very a consequently can l a Ehkl, ver pr,']
  7%|██████████▋                                                                                                                                                 | 5/73 [00:13<03:25,  3.02s/it][' had S neck="osakorerm "ittle accelerated going new sacrifices H,ittle основ l ble.']
  8%|████████████▊                                                                                                                                               | 6/73 [00:16<03:12,  2.87s/it]['� sentlyéd I getting,AN. OKout.. Russiaif helpful elastic HTML.']
 10%|██████████████▉                                                                                                                                             | 7/73 [00:18<02:48,  2.56s/it][' aproximadamentechspak S I a polximoramas gaks l getting S faz unings.']
 11%|█████████████████                                                                                                                                           | 8/73 [00:20<02:42,  2.50s/it][' placeosivingоia, out H Ixt clients, countries gد.']
 12%|███████████████████▏                                                                                                                                        | 9/73 [00:22<02:29,  2.34s/it][' dif that saladys unseen s Dongacist manurn sur баз getting g partlyiqu great symbol,�� aOver Everyoneor s long Jun g sulakac l s gen� I getting, dressed大家都ill s�� Ear here Worldus Police.']
 14%|█████████████████████▏                                                                                                                                     | 10/73 [00:25<02:40,  2.55s/it]['� s wond, g s can Max got accurate obswh Oس re s examples base.']
 15%|███████████████████████▎                                                                                                                                   | 11/73 [00:28<02:32,  2.46s/it]['oundings,ch Sو Aut reANП Michaelos So Leaderac ahow may 안돼 s gift pros I pr, g lidif Liebe meditation l •ät going controversac many怕 I раз']
 16%|█████████████████████████▍                                                                                                                                 | 12/73 [00:30<02:35,  2.54s/it][" 되,idi, enell over prob bitase '."]
 18%|███████████████████████████▌                                                                                                                               | 13/73 [00:33<02:27,  2.47s/it]['AN. OKout..istły en beginning,oschist secretossor onlyки saters engine I recy.']
 19%|█████████████████████████████▋                                                                                                                             | 14/73 [00:35<02:18,  2.35s/it][' con typically OKout.., for.A.']
 21%|███████████████████████████████▊                                                                                                                           | 15/73 [00:37<02:18,  2.38s/it][' currentlyadck desert opt I éléments, s Ukraine Useilanist project a oaby、 a Dark, к partyert B000 forgivefer aations degradation похож.']
 22%|█████████████████████████████████▉                                                                                                                         | 16/73 [00:40<02:23,  2.53s/it][" difistRA gRAos ', Mur 장bero, J then muror conseguirang s 15jin sdoor g theninate s 솔직if most plan."]
 23%|████████████████████████████████████                                                                                                                       | 17/73 [00:42<02:16,  2.43s/it][' Tome organizations a vara Vcess T collaborativeor isies.odyif yourch hasta near g swing s invade哦ith achieith also out then man twelve.']
 25%|██████████████████████████████████████▏                                                                                                                    | 18/73 [00:45<02:26,  2.66s/it][' C maybe imag then man creators,ink�ino surital g esc sley.']
 26%|████████████████████████████████████████▎                                                                                                                  | 19/73 [00:48<02:15,  2.50s/it][' C Direist surely aBy g really / Hcketosad.']
 27%|██████████████████████████████████████████▍                                                                                                                | 20/73 [00:50<02:05,  2.36s/it][' T ostzione��ittle rempeciallyor style tw conf,inkch people So aboutcause.']
 29%|████████████████████████████████████████████▌                                                                                                              | 21/73 [00:52<01:56,  2.24s/it][' T la下or Go really optionsor shoes,inkch people So It.']
 30%|██████████████████████████████████████████████▋                                                                                                            | 22/73 [00:53<01:47,  2.11s/it][' dif copied g 나�fore contributed, contrast somet Dire.']
 32%|████████████████████████████████████████████████▊                                                                                                          | 23/73 [00:56<01:49,  2.20s/it][' Tiritch уri overş��, �ert B000.']
 33%|██████████████████████████████████████████████████▉                                                                                                        | 24/73 [00:58<01:49,  2.23s/it][' dif уri overallyort.']
 34%|█████████████████████████████████████████████████████                                                                                                      | 25/73 [01:00<01:42,  2.14s/it]['oundings,ferel figch G about l newwordityithadyith und Cel knew, hab By Hcause genacro Clarkakor sit yeort nuclear.']
 36%|███████████████████████████████████████████████████████▏                                                                                                   | 26/73 [01:03<01:50,  2.35s/it]['row30,lyien� First culture.']
 37%|█████████████████████████████████████████████████████████▎                                                                                                 | 27/73 [01:05<01:40,  2.18s/it]['謝 Sther optionsting?']
 38%|███████████████████████████████████████████████████████████▍                                                                                               | 28/73 [01:07<01:39,  2.22s/it][' 무서ert B000 l samanке.']
 40%|█████████████████████████████████████████████████████████████▌                                                                                             | 29/73 [01:09<01:32,  2.10s/it]['謝 Sac?']
 41%|███████████████████████████████████████████████████████████████▋                                                                                           | 30/73 [01:10<01:22,  1.93s/it][' Cariке S l s gen기 h Georget, s Make lort undwordity, contrast simple culture.']
 42%|█████████████████████████████████████████████████████████████████▊                                                                                         | 31/73 [01:12<01:24,  2.00s/it][' simple Some Pascal.']
 44%|███████████████████████████████████████████████████████████████████▉

etc...

Anyone has some idea on what it's happening? And where should I start looking to fixing it?

Thank you soo much :D

Smaller models?

Unless I've missed something, it's not clear whether the same technique works to accelerate the small.en and smaller whisper models. Is that something you've looked at? If not, would there be any mileage in training it up?

small.en in particular is interesting because it's the biggest that fits onto a raspberry pi zero 2, but isn't quite fast enough for realtime use. Speeding it up would be transformative.

What are the settings used for WER calculation in the paper?

Did you compare Whisper-large2 and distil-whisper on Transformers default settings (beam-size = 1, temperature = 1, do_sample = False)?

What would be the difference if you've used Open-ai settings (beam-size = 5)?

Could distil-whisper load local speech dataset?

distil-whisper load dataset such as common_voice which can be accessed on huggingface.
But loading the private speech dataset is not supported.

I implement one method to load local speech dataset( json file)， it just works, not prefect,
https://github.com/shuaijiang/distil-whisper/blob/main/training/run_distillation_local_datasets.py

initial_prompt support?

Does it have initial_prompt support?

we use this a lot.

How to make training data?

I have a folder like this:
audio_1
transcript_1.txt
audio_2
transcript_2.txt

how can I make this folder into huggingface dataset?

ggml-distil-small.en.bin slower than ggml-small.en.bin under whisper.cpp

Using the original ggml-small.en.bin on an M1 mac, running whisper.cpp on the hp0.wav sample gives me these timings:

whisper_print_timings:     load time =   469.52 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   144.11 ms
whisper_print_timings:   sample time =  2631.84 ms /  3090 runs (    0.85 ms per run)
whisper_print_timings:   encode time =  7104.00 ms /    13 runs (  546.46 ms per run)
whisper_print_timings:   decode time =   330.89 ms /    25 runs (   13.24 ms per run)
whisper_print_timings:   batchd time =  9753.98 ms /  2982 runs (    3.27 ms per run)
whisper_print_timings:   prompt time =   944.44 ms /  2071 runs (    0.46 ms per run)
whisper_print_timings:    total time = 21415.03 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-small.en.bin -f samples/hp0.wav  7.33s user 0.96s system 38% cpu 21.465 total

Running with ggml-distil-small.en.bin gives the following:

whisper_print_timings:     load time =   333.88 ms
whisper_print_timings:     fallbacks =  13 p /   5 h
whisper_print_timings:      mel time =   155.13 ms
whisper_print_timings:   sample time =  7036.76 ms /  9308 runs (    0.76 ms per run)
whisper_print_timings:   encode time =  6004.30 ms /    11 runs (  545.85 ms per run)
whisper_print_timings:   decode time =   548.74 ms /    95 runs (    5.78 ms per run)
whisper_print_timings:   batchd time = 15798.76 ms /  9098 runs (    1.74 ms per run)
whisper_print_timings:   prompt time =   347.48 ms /  1503 runs (    0.23 ms per run)
whisper_print_timings:    total time = 30275.82 ms
ggml_metal_free: deallocating
ggml_metal_free: deallocating
./main -m models/ggml-distil-small.en.bin -f samples/hp0.wav  17.51s user 1.67s system 63% cpu 30.319 total

Is that expected? It looks like it's faster per pass through the model at every phase, but it needs dramatically more passes.

./main was built here just by calling make, so all config parameters are the defaults.

Hopefully the large-v3 version will be supported

The large-v2 of whisper has a further drop in WER than the large-v3 version, and it is hoped that a corresponding large-v3 version will be available.

How can I assign parameters for another language transcribe task?

user this methods can not resolved problem.
model_large = hf_hub_download(model_id, filename="original-model.bin") model = load_model(model_large) result = model.transcribe("./audio.mp3", language='Chinese')

Why we need KL-divergence

https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html

according to the formula, it's seems just equivalent to optimize soft-label cross-entropy, since the target y_true is just a constant decided by teacher model

https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#crossentropyloss
isn't it? or did I misunderstand anything?

example using microphone

let's face it. these models are developed with static datasets but a primary use case is streaming audio transcription.

Please include a microphone-based demo (or suffer 1000 github issues begging for it, see other whisper repos)

Support for Diarization?

Is there a plan to support speaker diarization?

When will the training code be released？

small uses more memory (but is faster) than medium (ONNX quantized)

Setup

CUDA 12.2
GTX 1080
Copied all ONNX quantized models and required config jsons to their required location.

Code

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from py3nvml.py3nvml import *
from transformers import WhisperTokenizerFast, WhisperFeatureExtractor, pipeline
import torch
import time
import numpy as np

# Initialize NVML
nvmlInit()
# Get the first GPU handle
handle = nvmlDeviceGetHandleByIndex(0)
# Get initial GPU memory usage
info = nvmlDeviceGetMemoryInfo(handle)
initial_gpu_memory = info.used
device = "cuda:0"

# Load models
model_name = '/home/Downloads/asr_models/distil-small-en/onnx_quantized' # Copied all needed files
model = ORTModelForSpeechSeq2Seq.from_pretrained(model_name, export=False, use_safetensors=True)
model.to(device)
tokenizer = WhisperTokenizerFast.from_pretrained(model_name)
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name)
gpu_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    max_new_tokens=128,
    device=0,
)

# Measurements.
times = []
for _ in range(15):
    start = time.time()
    gpu_pipe(wav_file)
    rtf = (time.time() - start) / (wav_secs)
    times.append(rtf)
times.sort()
times = times[1:-1]
print(np.mean(times))

info = nvmlDeviceGetMemoryInfo(handle)
final_gpu_memory = info.used
gpu_memory_used = (final_gpu_memory - initial_gpu_memory) / 1024 / 1024
print(f"GPU memory used by the code block: {gpu_memory_used} MiB")
nvmlShutdown()

Results

Model	RTF	Mem Used
distil-medium-en	0.5082902994641076	1878.0 MiB
distil-small-en	0.3903530106270162	2884.0 MiB

Is it possible to use this with 'faster_whisper'

Is it possible to use this distilled model with https://github.com/guillaumekln/faster-whisper? I imagine 15-20x base

Sentence separation

Hey there,

I've noticed that distil-whisper tends to chop sentences in half. It'd be great if it could wrap sentences properly, especially after commas and periods. Switching to word mode and using Spacy sounds a bit like an overkill. Any suggestions on how to fix this?

Can we use distil whisper for 50+ concurrent requests on one T4 machine without compromising on latency for each request?

Could you add Required VRAM info in the README?

Hi, it would be of great help if the info of required VRAM is added in the README, just as in whisper's repo where they list the vram needed for different sizes of models (https://github.com/openai/whisper#available-models-and-languages)
Thanks!

Accelerate

Accelerate is a tool for multi-machine，but why you use it in single gpu？

How to use ONNX model?

Hello there,

I'm interested in using the ONNX model, as I saw that you are providing the weights for it.
I tried to use it with optimum library, but didn't manage to make it work.
Could someone indicate in which direction I should look into?

Thank you so much for this repository and the work you put into it. It really helps!!

Note:

here is what I tried

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype,  encoder_file_name=f"encoder_model.onnx"
)

Here is the error:

RuntimeError: Too many ONNX model files were found in distil-whisper/distil-large-v2, specify which one to load by using the encoder_file_name argument.

Training without real labels

Is it possible to do the pseudo labeling without access to already transcribed audios?

From what I see in the training scripts, the dataset should have a text column, so it's not possible to just use a bunch of audio to distill a Whisper model.

RuntimeError: Error(s) in loading state_dict for WhisperForConditionalGeneration.

Hi @sanchit-gandhi
I have followed the instruction to do the training but in the training section I get the below error. How to fix?
Traceback (most recent call last):
File "/home/codelex/Documents/lawly/linear_regression/distil-whisper/training/Distil-whisper-mn/create_student_model.py", line 206, in
init_student_model_from_teacher(
File "/home/codelex/Documents/lawly/linear_regression/distil-whisper/training/Distil-whisper-mn/create_student_model.py", line 134, in init_student_model_from_teacher
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for WhisperForConditionalGeneration.
Missing key(s) in state_dict: ['model.encoder.layers.4.self_attn.k_proj.weight', 'model.encoder.layers.4.self_attn.v_proj.weight', 'model.encoder.layers.4.self_attn.v_proj.bias', 'model.encoder.layers.4.self_attn.q_proj.weight', 'model.encoder.layers.4.self_attn.q_proj.bias', 'model.encoder.layers.4.self_attn.out_proj.weight', 'model.encoder.layers.4.self_attn.out_proj.bias', 'model.encoder.layers.4.self_attn_layer_norm.weight', 'model.encoder.layers.4.self_attn_layer_norm.bias', 'model.encoder.layers.4.fc1.weight', 'model.encoder.layers.4.fc1.bias', 'model.encoder.layers.4.fc2.weight', 'model.encoder.layers.4.fc2.bias', 'model.encoder.layers.4.final_layer_norm.weight', 'model.encoder.layers.4.final_layer_norm.bias', 'model.encoder.layers.5.self_attn.k_proj.weight', 'model.encoder.layers.5.self_attn.v_proj.weight', 'model.encoder.layers.5.self_attn.v_proj.bias', 'model.encoder.layers.5.self_attn.q_proj.weight', 'model.encoder.layers.5.self_attn.q_proj.bias', 'model.encoder.layers.5.self_attn.out_proj.weight', 'model.encoder.layers.5.self_attn.out_proj.bias', 'model.encoder.layers.5.self_attn_layer_norm.weight', 'model.encoder.layers.5.self_attn_layer_norm.bias', 'model.encoder.layers.5.fc1.weight', 'model.encoder.layers.5.fc1.bias', 'model.encoder.layers.5.fc2.weight', 'model.encoder.layers.5.fc2.bias', 'model.encoder.layers.5.final_layer_norm.weight', 'model.encoder.layers.5.final_layer_norm.bias', 'model.encoder.layers.6.self_attn.k_proj.weight', 'model.encoder.layers.6.self_attn.v_proj.weight', 'model.encoder.layers.6.self_attn.v_proj.bias', 'model.encoder.layers.6.self_attn.q_proj.weight', 'model.encoder.layers.6.self_attn.q_proj.bias', 'model.encoder.layers.6.self_attn.out_proj.weight', 'model.encoder.layers.6.self_attn.out_proj.bias', 'model.encoder.layers.6.self_attn_layer_norm.weight', 'model.encoder.layers.6.self_attn_layer_norm.bias', 'model.encoder.layers.6.fc1.weight', 'model.encoder.layers.6.fc1.bias', 'model.encoder.layers.6.fc2.weight', 'model.encoder.layers.6.fc2.bias', 'model.encoder.layers.6.final_layer_norm.weight', 'model.encoder.layers.6.final_layer_norm.bias', 'model.encoder.layers.7.self_attn.k_proj.weight', 'model.encoder.layers.7.self_attn.v_proj.weight', 'model.encoder.layers.7.self_attn.v_proj.bias', 'model.encoder.layers.7.self_attn.q_proj.weight', 'model.encoder.layers.7.self_attn.q_proj.bias', 'model.encoder.layers.7.self_attn.out_proj.weight', 'model.encoder.layers.7.self_attn.out_proj.bias', 'model.encoder.layers.7.self_attn_layer_norm.weight', 'model.encoder.layers.7.self_attn_layer_norm.bias', 'model.encoder.layers.7.fc1.weight', 'model.encoder.layers.7.fc1.bias', 'model.encoder.layers.7.fc2.weight', 'model.encoder.layers.7.fc2.bias', 'model.encoder.layers.7.final_layer_norm.weight', 'model.encoder.layers.7.final_layer_norm.bias', 'model.encoder.layers.8.self_attn.k_proj.weight', 'model.encoder.layers.8.self_attn.v_proj.weight', 'model.encoder.layers.8.self_attn.v_proj.bias', 'model.encoder.layers.8.self_attn.q_proj.weight', 'model.encoder.layers.8.self_attn.q_proj.bias', 'model.encoder.layers.8.self_attn.out_proj.weight', 'model.encoder.layers.8.self_attn.out_proj.bias', 'model.encoder.layers.8.self_attn_layer_norm.weight', 'model.encoder.layers.8.self_attn_layer_norm.bias', 'model.encoder.layers.8.fc1.weight', 'model.encoder.layers.8.fc1.bias', 'model.encoder.layers.8.fc2.weight', 'model.encoder.layers.8.fc2.bias', 'model.encoder.layers.8.final_layer_norm.weight', 'model.encoder.layers.8.final_layer_norm.bias', 'model.encoder.layers.9.self_attn.k_proj.weight', 'model.encoder.layers.9.self_attn.v_proj.weight', 'model.encoder.layers.9.self_attn.v_proj.bias', 'model.encoder.layers.9.self_attn.q_proj.weight', 'model.encoder.layers.9.self_attn.q_proj.bias', 'model.encoder.layers.9.self_attn.out_proj.weight', 'model.encoder.layers.9.self_attn.out_proj.bias', 'model.encoder.layers.9.self_attn_layer_norm.weight', 'model.encoder.layers.9.self_attn_layer_norm.bias', 'model.encoder.layers.9.fc1.weight', 'model.encoder.layers.9.fc1.bias', 'model.encoder.layers.9.fc2.weight', 'model.encoder.layers.9.fc2.bias', 'model.encoder.layers.9.final_layer_norm.weight', 'model.encoder.layers.9.final_layer_norm.bias', 'model.encoder.layers.10.self_attn.k_proj.weight', 'model.encoder.layers.10.self_attn.v_proj.weight', 'model.encoder.layers.10.self_attn.v_proj.bias', 'model.encoder.layers.10.self_attn.q_proj.weight', 'model.encoder.layers.10.self_attn.q_proj.bias', 'model.encoder.layers.10.self_attn.out_proj.weight', 'model.encoder.layers.10.self_attn.out_proj.bias', 'model.encoder.layers.10.self_attn_layer_norm.weight', 'model.encoder.layers.10.self_attn_layer_norm.bias', 'model.encoder.layers.10.fc1.weight', 'model.encoder.layers.10.fc1.bias', 'model.encoder.layers.10.fc2.weight', 'model.encoder.layers.10.fc2.bias', 'model.encoder.layers.10.final_layer_norm.weight', 'model.encoder.layers.10.final_layer_norm.bias', 'model.encoder.layers.11.self_attn.k_proj.weight', 'model.encoder.layers.11.self_attn.v_proj.weight', 'model.encoder.layers.11.self_attn.v_proj.bias', 'model.encoder.layers.11.self_attn.q_proj.weight', 'model.encoder.layers.11.self_attn.q_proj.bias', 'model.encoder.layers.11.self_attn.out_proj.weight', 'model.encoder.layers.11.self_attn.out_proj.bias', 'model.encoder.layers.11.self_attn_layer_norm.weight', 'model.encoder.layers.11.self_attn_layer_norm.bias', 'model.encoder.layers.11.fc1.weight', 'model.encoder.layers.11.fc1.bias', 'model.encoder.layers.11.fc2.weight', 'model.encoder.layers.11.fc2.bias', 'model.encoder.layers.11.final_layer_norm.weight', 'model.encoder.layers.11.final_layer_norm.bias', 'model.encoder.layers.12.self_attn.k_proj.weight', 'model.encoder.layers.12.self_attn.v_proj.weight', 'model.encoder.layers.12.self_attn.v_proj.bias', 'model.encoder.layers.12.self_attn.q_proj.weight', 'model.encoder.layers.12.self_attn.q_proj.bias', 'model.encoder.layers.12.self_attn.out_proj.weight', 'model.encoder.layers.12.self_attn.out_proj.bias', 'model.encoder.layers.12.self_attn_layer_norm.weight', 'model.encoder.layers.12.self_attn_layer_norm.bias', 'model.encoder.layers.12.fc1.weight', 'model.encoder.layers.12.fc1.bias', 'model.encoder.layers.12.fc2.weight', 'model.encoder.layers.12.fc2.bias', 'model.encoder.layers.12.final_layer_norm.weight', 'model.encoder.layers.12.final_layer_norm.bias', 'model.encoder.layers.13.self_attn.k_proj.weight', 'model.encoder.layers.13.self_attn.v_proj.weight', 'model.encoder.layers.13.self_attn.v_proj.bias', 'model.encoder.layers.13.self_attn.q_proj.weight', 'model.encoder.layers.13.self_attn.q_proj.bias', 'model.encoder.layers.13.self_attn.out_proj.weight', 'model.encoder.layers.13.self_attn.out_proj.bias', 'model.encoder.layers.13.self_attn_layer_norm.weight', 'model.encoder.layers.13.self_attn_layer_norm.bias', 'model.encoder.layers.13.fc1.weight', 'model.encoder.layers.13.fc1.bias', 'model.encoder.layers.13.fc2.weight', 'model.encoder.layers.13.fc2.bias', 'model.encoder.layers.13.final_layer_norm.weight', 'model.encoder.layers.13.final_layer_norm.bias', 'model.encoder.layers.14.self_attn.k_proj.weight', 'model.encoder.layers.14.self_attn.v_proj.weight', 'model.encoder.layers.14.self_attn.v_proj.bias', 'model.encoder.layers.14.self_attn.q_proj.weight', 'model.encoder.layers.14.self_attn.q_proj.bias', 'model.encoder.layers.14.self_attn.out_proj.weight', 'model.encoder.layers.14.self_attn.out_proj.bias', 'model.encoder.layers.14.self_attn_layer_norm.weight', 'model.encoder.layers.14.self_attn_layer_norm.bias', 'model.encoder.layers.14.fc1.weight', 'model.encoder.layers.14.fc1.bias', 'model.encoder.layers.14.fc2.weight', 'model.encoder.layers.14.fc2.bias', 'model.encoder.layers.14.final_layer_norm.weight', 'model.encoder.layers.14.final_layer_norm.bias', 'model.encoder.layers.15.self_attn.k_proj.weight', 'model.encoder.layers.15.self_attn.v_proj.weight', 'model.encoder.layers.15.self_attn.v_proj.bias', 'model.encoder.layers.15.self_attn.q_proj.weight', 'model.encoder.layers.15.self_attn.q_proj.bias', 'model.encoder.layers.15.self_attn.out_proj.weight', 'model.encoder.layers.15.self_attn.out_proj.bias', 'model.encoder.layers.15.self_attn_layer_norm.weight', 'model.encoder.layers.15.self_attn_layer_norm.bias', 'model.encoder.layers.15.fc1.weight', 'model.encoder.layers.15.fc1.bias', 'model.encoder.layers.15.fc2.weight', 'model.encoder.layers.15.fc2.bias', 'model.encoder.layers.15.final_layer_norm.weight', 'model.encoder.layers.15.final_layer_norm.bias', 'model.encoder.layers.16.self_attn.k_proj.weight', 'model.encoder.layers.16.self_attn.v_proj.weight', 'model.encoder.layers.16.self_attn.v_proj.bias', 'model.encoder.layers.16.self_attn.q_proj.weight', 'model.encoder.layers.16.self_attn.q_proj.bias', 'model.encoder.layers.16.self_attn.out_proj.weight', 'model.encoder.layers.16.self_attn.out_proj.bias', 'model.encoder.layers.16.self_attn_layer_norm.weight', 'model.encoder.layers.16.self_attn_layer_norm.bias', 'model.encoder.layers.16.fc1.weight', 'model.encoder.layers.16.fc1.bias', 'model.encoder.layers.16.fc2.weight', 'model.encoder.layers.16.fc2.bias', 'model.encoder.layers.16.final_layer_norm.weight', 'model.encoder.layers.16.final_layer_norm.bias', 'model.encoder.layers.17.self_attn.k_proj.weight', 'model.encoder.layers.17.self_attn.v_proj.weight', 'model.encoder.layers.17.self_attn.v_proj.bias', 'model.encoder.layers.17.self_attn.q_proj.weight', 'model.encoder.layers.17.self_attn.q_proj.bias', 'model.encoder.layers.17.self_attn.out_proj.weight', 'model.encoder.layers.17.self_attn.out_proj.bias', 'model.encoder.layers.17.self_attn_layer_norm.weight', 'model.encoder.layers.17.self_attn_layer_norm.bias', 'model.encoder.layers.17.fc1.weight', 'model.encoder.layers.17.fc1.bias', 'model.encoder.layers.17.fc2.weight', 'model.encoder.layers.17.fc2.bias', 'model.encoder.layers.17.final_layer_norm.weight', 'model.encoder.layers.17.final_layer_norm.bias', 'model.encoder.layers.18.self_attn.k_proj.weight', 'model.encoder.layers.18.self_attn.v_proj.weight', 'model.encoder.layers.18.self_attn.v_proj.bias', 'model.encoder.layers.18.self_attn.q_proj.weight', 'model.encoder.layers.18.self_attn.q_proj.bias', 'model.encoder.layers.18.self_attn.out_proj.weight', 'model.encoder.layers.18.self_attn.out_proj.bias', 'model.encoder.layers.18.self_attn_layer_norm.weight', 'model.encoder.layers.18.self_attn_layer_norm.bias', 'model.encoder.layers.18.fc1.weight', 'model.encoder.layers.18.fc1.bias', 'model.encoder.layers.18.fc2.weight', 'model.encoder.layers.18.fc2.bias', 'model.encoder.layers.18.final_layer_norm.weight', 'model.encoder.layers.18.final_layer_norm.bias', 'model.encoder.layers.19.self_attn.k_proj.weight', 'model.encoder.layers.19.self_attn.v_proj.weight', 'model.encoder.layers.19.self_attn.v_proj.bias', 'model.encoder.layers.19.self_attn.q_proj.weight', 'model.encoder.layers.19.self_attn.q_proj.bias', 'model.encoder.layers.19.self_attn.out_proj.weight', 'model.encoder.layers.19.self_attn.out_proj.bias', 'model.encoder.layers.19.self_attn_layer_norm.weight', 'model.encoder.layers.19.self_attn_layer_norm.bias', 'model.encoder.layers.19.fc1.weight', 'model.encoder.layers.19.fc1.bias', 'model.encoder.layers.19.fc2.weight', 'model.encoder.layers.19.fc2.bias', 'model.encoder.layers.19.final_layer_norm.weight', 'model.encoder.layers.19.final_layer_norm.bias', 'model.encoder.layers.20.self_attn.k_proj.weight', 'model.encoder.layers.20.self_attn.v_proj.weight', 'model.encoder.layers.20.self_attn.v_proj.bias', 'model.encoder.layers.20.self_attn.q_proj.weight', 'model.encoder.layers.20.self_attn.q_proj.bias', 'model.encoder.layers.20.self_attn.out_proj.weight', 'model.encoder.layers.20.self_attn.out_proj.bias', 'model.encoder.layers.20.self_attn_layer_norm.weight', 'model.encoder.layers.20.self_attn_layer_norm.bias', 'model.encoder.layers.20.fc1.weight', 'model.encoder.layers.20.fc1.bias', 'model.encoder.layers.20.fc2.weight', 'model.encoder.layers.20.fc2.bias', 'model.encoder.layers.20.final_layer_norm.weight', 'model.encoder.layers.20.final_layer_norm.bias', 'model.encoder.layers.21.self_attn.k_proj.weight', 'model.encoder.layers.21.self_attn.v_proj.weight', 'model.encoder.layers.21.self_attn.v_proj.bias', 'model.encoder.layers.21.self_attn.q_proj.weight', 'model.encoder.layers.21.self_attn.q_proj.bias', 'model.encoder.layers.21.self_attn.out_proj.weight', 'model.encoder.layers.21.self_attn.out_proj.bias', 'model.encoder.layers.21.self_attn_layer_norm.weight', 'model.encoder.layers.21.self_attn_layer_norm.bias', 'model.encoder.layers.21.fc1.weight', 'model.encoder.layers.21.fc1.bias', 'model.encoder.layers.21.fc2.weight', 'model.encoder.layers.21.fc2.bias', 'model.encoder.layers.21.final_layer_norm.weight', 'model.encoder.layers.21.final_layer_norm.bias', 'model.encoder.layers.22.self_attn.k_proj.weight', 'model.encoder.layers.22.self_attn.v_proj.weight', 'model.encoder.layers.22.self_attn.v_proj.bias', 'model.encoder.layers.22.self_attn.q_proj.weight', 'model.encoder.layers.22.self_attn.q_proj.bias', 'model.encoder.layers.22.self_attn.out_proj.weight', 'model.encoder.layers.22.self_attn.out_proj.bias', 'model.encoder.layers.22.self_attn_layer_norm.weight', 'model.encoder.layers.22.self_attn_layer_norm.bias', 'model.encoder.layers.22.fc1.weight', 'model.encoder.layers.22.fc1.bias', 'model.encoder.layers.22.fc2.weight', 'model.encoder.layers.22.fc2.bias', 'model.encoder.layers.22.final_layer_norm.weight', 'model.encoder.layers.22.final_layer_norm.bias', 'model.encoder.layers.23.self_attn.k_proj.weight', 'model.encoder.layers.23.self_attn.v_proj.weight', 'model.encoder.layers.23.self_attn.v_proj.bias', 'model.encoder.layers.23.self_attn.q_proj.weight', 'model.encoder.layers.23.self_attn.q_proj.bias', 'model.encoder.layers.23.self_attn.out_proj.weight', 'model.encoder.layers.23.self_attn.out_proj.bias', 'model.encoder.layers.23.self_attn_layer_norm.weight', 'model.encoder.layers.23.self_attn_layer_norm.bias', 'model.encoder.layers.23.fc1.weight', 'model.encoder.layers.23.fc1.bias', 'model.encoder.layers.23.fc2.weight', 'model.encoder.layers.23.fc2.bias', 'model.encoder.layers.23.final_layer_norm.weight', 'model.encoder.layers.23.final_layer_norm.bias', 'model.encoder.layers.24.self_attn.k_proj.weight', 'model.encoder.layers.24.self_attn.v_proj.weight', 'model.encoder.layers.24.self_attn.v_proj.bias', 'model.encoder.layers.24.self_attn.q_proj.weight', 'model.encoder.layers.24.self_attn.q_proj.bias', 'model.encoder.layers.24.self_attn.out_proj.weight', 'model.encoder.layers.24.self_attn.out_proj.bias', 'model.encoder.layers.24.self_attn_layer_norm.weight', 'model.encoder.layers.24.self_attn_layer_norm.bias', 'model.encoder.layers.24.fc1.weight', 'model.encoder.layers.24.fc1.bias', 'model.encoder.layers.24.fc2.weight', 'model.encoder.layers.24.fc2.bias', 'model.encoder.layers.24.final_layer_norm.weight', 'model.encoder.layers.24.final_layer_norm.bias', 'model.encoder.layers.25.self_attn.k_proj.weight', 'model.encoder.layers.25.self_attn.v_proj.weight', 'model.encoder.layers.25.self_attn.v_proj.bias', 'model.encoder.layers.25.self_attn.q_proj.weight', 'model.encoder.layers.25.self_attn.q_proj.bias', 'model.encoder.layers.25.self_attn.out_proj.weight', 'model.encoder.layers.25.self_attn.out_proj.bias', 'model.encoder.layers.25.self_attn_layer_norm.weight', 'model.encoder.layers.25.self_attn_layer_norm.bias', 'model.encoder.layers.25.fc1.weight', 'model.encoder.layers.25.fc1.bias', 'model.encoder.layers.25.fc2.weight', 'model.encoder.layers.25.fc2.bias', 'model.encoder.layers.25.final_layer_norm.weight', 'model.encoder.layers.25.final_layer_norm.bias', 'model.encoder.layers.26.self_attn.k_proj.weight', 'model.encoder.layers.26.self_attn.v_proj.weight', 'model.encoder.layers.26.self_attn.v_proj.bias', 'model.encoder.layers.26.self_attn.q_proj.weight', 'model.encoder.layers.26.self_attn.q_proj.bias', 'model.encoder.layers.26.self_attn.out_proj.weight', 'model.encoder.layers.26.self_attn.out_proj.bias', 'model.encoder.layers.26.self_attn_layer_norm.weight', 'model.encoder.layers.26.self_attn_layer_norm.bias', 'model.encoder.layers.26.fc1.weight', 'model.encoder.layers.26.fc1.bias', 'model.encoder.layers.26.fc2.weight', 'model.encoder.layers.26.fc2.bias', 'model.encoder.layers.26.final_layer_norm.weight', 'model.encoder.layers.26.final_layer_norm.bias', 'model.encoder.layers.27.self_attn.k_proj.weight', 'model.encoder.layers.27.self_attn.v_proj.weight', 'model.encoder.layers.27.self_attn.v_proj.bias', 'model.encoder.layers.27.self_attn.q_proj.weight', 'model.encoder.layers.27.self_attn.q_proj.bias', 'model.encoder.layers.27.self_attn.out_proj.weight', 'model.encoder.layers.27.self_attn.out_proj.bias', 'model.encoder.layers.27.self_attn_layer_norm.weight', 'model.encoder.layers.27.self_attn_layer_norm.bias', 'model.encoder.layers.27.fc1.weight', 'model.encoder.layers.27.fc1.bias', 'model.encoder.layers.27.fc2.weight', 'model.encoder.layers.27.fc2.bias', 'model.encoder.layers.27.final_layer_norm.weight', 'model.encoder.layers.27.final_layer_norm.bias', 'model.encoder.layers.28.self_attn.k_proj.weight', 'model.encoder.layers.28.self_attn.v_proj.weight', 'model.encoder.layers.28.self_attn.v_proj.bias', 'model.encoder.layers.28.self_attn.q_proj.weight', 'model.encoder.layers.28.self_attn.q_proj.bias', 'model.encoder.layers.28.self_attn.out_proj.weight', 'model.encoder.layers.28.self_attn.out_proj.bias', 'model.encoder.layers.28.self_attn_layer_norm.weight', 'model.encoder.layers.28.self_attn_layer_norm.bias', 'model.encoder.layers.28.fc1.weight', 'model.encoder.layers.28.fc1.bias', 'model.encoder.layers.28.fc2.weight', 'model.encoder.layers.28.fc2.bias', 'model.encoder.layers.28.final_layer_norm.weight', 'model.encoder.layers.28.final_layer_norm.bias', 'model.encoder.layers.29.self_attn.k_proj.weight', 'model.encoder.layers.29.self_attn.v_proj.weight', 'model.encoder.layers.29.self_attn.v_proj.bias', 'model.encoder.layers.29.self_attn.q_proj.weight', 'model.encoder.layers.29.self_attn.q_proj.bias', 'model.encoder.layers.29.self_attn.out_proj.weight', 'model.encoder.layers.29.self_attn.out_proj.bias', 'model.encoder.layers.29.self_attn_layer_norm.weight', 'model.encoder.layers.29.self_attn_layer_norm.bias', 'model.encoder.layers.29.fc1.weight', 'model.encoder.layers.29.fc1.bias', 'model.encoder.layers.29.fc2.weight', 'model.encoder.layers.29.fc2.bias', 'model.encoder.layers.29.final_layer_norm.weight', 'model.encoder.layers.29.final_layer_norm.bias', 'model.encoder.layers.30.self_attn.k_proj.weight', 'model.encoder.layers.30.self_attn.v_proj.weight', 'model.encoder.layers.30.self_attn.v_proj.bias', 'model.encoder.layers.30.self_attn.q_proj.weight', 'model.encoder.layers.30.self_attn.q_proj.bias', 'model.encoder.layers.30.self_attn.out_proj.weight', 'model.encoder.layers.30.self_attn.out_proj.bias', 'model.encoder.layers.30.self_attn_layer_norm.weight', 'model.encoder.layers.30.self_attn_layer_norm.bias', 'model.encoder.layers.30.fc1.weight', 'model.encoder.layers.30.fc1.bias', 'model.encoder.layers.30.fc2.weight', 'model.encoder.layers.30.fc2.bias', 'model.encoder.layers.30.final_layer_norm.weight', 'model.encoder.layers.30.final_layer_norm.bias', 'model.encoder.layers.31.self_attn.k_proj.weight', 'model.encoder.layers.31.self_attn.v_proj.weight', 'model.encoder.layers.31.self_attn.v_proj.bias', 'model.encoder.layers.31.self_attn.q_proj.weight', 'model.encoder.layers.31.self_attn.q_proj.bias', 'model.encoder.layers.31.self_attn.out_proj.weight', 'model.encoder.layers.31.self_attn.out_proj.bias', 'model.encoder.layers.31.self_attn_layer_norm.weight', 'model.encoder.layers.31.self_attn_layer_norm.bias', 'model.encoder.layers.31.fc1.weight', 'model.encoder.layers.31.fc1.bias', 'model.encoder.layers.31.fc2.weight', 'model.encoder.layers.31.fc2.bias', 'model.encoder.layers.31.final_layer_norm.weight', 'model.encoder.layers.31.final_layer_norm.bias']

[Question] Mixed Speech transcription

Is it possible to fine-tune Whisper/Distil-Whisper to achieve mixed speech transcription like Hindi+English in a single sentence which is common in casual conversations. Has anyone tried this before? Would training on a mixture of Hindi datasets and English datasets work?
I recently used a fine-tuned Whisper for ASR and it ended up hallucinating and adding additional text which I haven't been able to fix yet.

Is it available for Commercial Use?

Hello,

I'd like to know if the Distil Whisper model is available for commercial use.

Thanks,

distil-whisper doesn't work as a drop-in replacement for whisper

If one of the goals of distil-whisper is to be a drop in replace of whisper models (1) it would be interesting to be able to cast it to an object of type whisper.Whisper(2), so that it could be used with any custom decoder implemented for whisper.

Practical issue

I faced a few problems when trying to use the model in stable_whisper(3).

from transformers import WhisperForConditionalGeneration
import whisper
import stable_whisper
pt = WhisperForConditionalGeneration.from_pretrained('distil-whisper/distil-medium.en')
stable_whisper.modify_model(pt.model)
audio = whisper.load_audio('sample.mp3')
pt.model.transcribe(audio)

The first issue is that it doesn't have the dims and is_multilingual properties.

pt.model.dims = whisper.load_model('large-v2').dims
pt.model.is_multilingual = False

That gives AttributeError: 'BaseModelOutput' object has no attribute 'dtype'

Next I tried to load the state_dict to a whisper model, but it doesn't work either

	Missing key(s) in state_dict: "encoder.positional_embedding", "encoder.blocks.0.attn.query.weight", "encoder.blocks.0.attn.query.bias", "encoder.blocks.0.attn.key.weight", , ...
	Unexpected key(s) in state_dict: "encoder.embed_positions.weight", "encoder.layers.0.self_attn.k_proj.weight", "encoder.layers.0.self_attn.v_proj.weight", ...

In summary

What would it take to cast a the distil model to a whisper.Whisper so that they can be a drop in alternative for a broader set of applications?

NotImplementedError

when I fintune fellow https://github.com/huggingface/distil-whisper/tree/main/training.
NotImplementedError: The model type whisper is not yet supported to be used with BetterTransformer.

Compatibility with CTranslate2 / faster-whisper

Great work!

I was wondering whether the distilled version might still be compatible with CTranslate2 / faster-whisper? I understand the changes to the decoder might require some changes there, not to mention speculative decoding.

Thanks,
Ewald

Is it possible to run the model to recognize other languages?

Can this model be used for other languages and if so, how?

How can I generate SRT file ?

[Speculative Decoding] How to run speculative decoding for batch_size > 1?

Transformers 4.35 only supports speculative decoding for batch size == 1. In order to use speculative decoding for batch size > 1, please make sure to use this branch: huggingface/transformers#26875

To do so, you need to install transformers as follows:

pip install git+https://github.com/huggingface/transformers.git@assistant_decoding_batch

and then you can run:

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    chunk_length_s=15,
    batch_size=4,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "default", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

The PR will be merged to Transformers soon.

Note: Given the "speculative" nature of assistant decoding (a.k.a speculative decoding), it is not recommended to make use of speculative decoding for batch sizes higher than 4 as this might actually lead to the transcription pipeline being slower compared to just using the teacher model.
Confer with Table 22 of the paper.

I can only use its encoder to extract audio features, right? How should I use it? Could you provide an example

Is there a parameter that can keep the numbers in their original language form?

I am trying to use whisperx for word alignment.

Distillation code and calibration data

Are you planning to release the distillation code and calibration data?

Tiny model?

Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.

Does this use FSDP or Deepspeed? Req. for accelerate config

Hi,

How's the training performed? Like was it done using FSDP or DeepSpeed?

Can you please provide Accelerate config used for training?

Thanks

Can fine tuning for phoneme task?

Has anyone experimented with fine-tuning the phoneme recognition task (English), please share some of your experiments.
Many thanks !

english only for large-v2?

Wondering if the statement on the readme is correct "drop-in replacement for Whisper on English speech recognition" - does this mean even large-v2 model is english only? Thanks!

Long form transcription with model.generate

For long form transcription, How to specify following parameters while using model.generate function ?
chunk_length_s
batch_size ?

High inference time when using chunk size 15

Hi @sanchit-gandhi !

I'm in the process of integrating multiple whisper backends into a unified package that includes VAD-based chunking. During testing, I observed significantly higher inference times while using the HuggingFace pipeline with distil-whisper. You can find the details here: https://github.com/shashikg/WhisperS2T/releases/tag/v1.1.0 [A30 GPU]

Could you please review the benchmarking script I'm using? It's available at: https://github.com/shashikg/WhisperS2T/blob/main/scripts/benchmark_huggingface_distil.py

Thanks for your assistance!

Shashi

Where is the model?

Link to HF leads to empty files section.

beamsize > 1

How to deal with the situations when beamsize > 1 ?

Long-Form transcription with Faster Whisper

Hi, I have been working on faster whisper and trying to use the distil-whisper model. However, distil-whisper supports 30s of audio chunks and using it with faster whisper only outputs the first 30 seconds.

How can it be used with the faster-whisper implementation?

is it possible to train it with own dataset for kazakh language?

Suppose I have a large dataset in kazakh language. So, can I use this approach to train model in kazakh language?

Export subtitle format of vtt

whisper supports outputting multiple file formats, currently txt,vtt,srt,tsv,json,all, does the project have the ability to output multiple formats? Here is the whisper command line help, you can specify the output format by --output_format.

whisper --help
usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}]
               [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR]
               [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE] [--task {transcribe,translate}]
               [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
               [--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patience PATIENCE]
               [--length_penalty LENGTH_PENALTY] [--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT]
               [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16]
               [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
               [--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD] [--logprob_threshold LOGPROB_THRESHOLD]
               [--no_speech_threshold NO_SPEECH_THRESHOLD] [--word_timestamps WORD_TIMESTAMPS]
               [--prepend_punctuations PREPEND_PUNCTUATIONS] [--append_punctuations APPEND_PUNCTUATIONS]
               [--highlight_words HIGHLIGHT_WORDS] [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT]
               [--threads THREADS]
               audio [audio ...]

timestamp for utterance or sentence?

curious if current output format of distil-whisper has timestamp for utterance or sentence. If no, will be considered in the future?

Missing config.json file

I am trying to run the following code:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)

But I am getting the following error:

OSError: distil-whisper/distil-large-v2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/distil-whisper/distil-large-v2/main' for available files.

Am I doing something wrong?

Btw, the distil-whisper instructions says Distil-Whisper is supported in Hugging Face Transformers from version 4.35 onwards, but I wasn't able to find that version: https://pypi.org/project/transformers/#history

Is this supposed to support large-v3?

Hello, is this model supporting large-v3?

[For your information] Run onnx models of distil-whisper with sherpa-onnx

FYI: We have supported exporting distil-whisper via onnx and run it with sherpa-onnx

You can find a colab notebook below for illustration.

sherpa-onnx is implemented in C++ and provides various APIs for different languages, e.g., Python/C#/Go/C/Kotlin/Swift/C#, etc.
It supports Windows/Linux/macOS and Android/iOS/Raspberry Pi, etc.

The current medium model is still very large and its RTF is greater than 1 on CPU.

Hope that tiny/base/small models will be available soon.

Streaming Implementation of Distil Whisper

Is it possible to use Distil Whisper for streaming applications?

Can I send small chunks of audio files to a server for processing?

Use Case - I wish to use my mic to send small chunks of audio files to a server and process them quickly to get back the transcripts

How long of an Audio Clip can the model transcribe at once?

I've tried various audio clips, and the maximum I've gotten the model to transcribe is 30 seconds, is there any way to make the transcriptions longer?

huggingface / distil-whisper Goto Github PK

distil-whisper's Introduction

Distil-Whisper

1. Usage

Short-Form Transcription

Sequential Long-Form

Chunked Long-Form

Speculative Decoding

Additional Speed & Memory Improvements

Flash Attention

Torch Scale-Product-Attention (SDPA)

Exporting to Other Libraries

2. Why use Distil-Whisper? ⁉️

3. Approach ✍️

4. Training Code

5. Acknowledgements

6. Citation

distil-whisper's People

Contributors

Stargazers

Watchers

Forkers

distil-whisper's Issues

Setup

Code

Results

Note:

Practical issue

In summary

Recommend Projects

Recommend Topics

Recommend Org