Hi. Can the exact code from run_eval.py be used t

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

All in all, the PR <a class="issue-link js-issue-link" data-error-text="Failed to load

Unable to reproduce results from the paper about distil-whisper HOT 6 CLOSED

MLMonkATGY commented on September 25, 2024

Unable to reproduce results from the paper

from distil-whisper.

Comments (6)

bryanyzhu commented on September 25, 2024

from distil-whisper.

sanchit-gandhi commented on September 25, 2024

Hey @MLMonkATGY! Could you share the arguments you're passing to run_eval.py so that I can reproduce locally? I believe this is because we are using the BasicNormalizer in the PyTorch script run_eval.py:

distil-whisper/training/run_eval.py

Lines 545 to 548 in 3490d8e

    
           normalizer = ( 
        
               BasicTextNormalizer() if data_args.language is not None 
        
               else EnglishTextNormalizer(processor.tokenizer.english_spelling_normalizer) 
        
           )

Whereas in the original Flax scripts, we always used the EnglishNormalizer:

distil-whisper/training/flax/run_eval.py

Line 728 in 3490d8e

normalizer = EnglishTextNormalizer(tokenizer.english_spelling_normalizer)

You should be able to reproduce the results one-to-one if you use the Flax script. I'll also update the PyTorch script to use the EnglishNormalizer if the language used is English!

from distil-whisper.

MLMonkATGY commented on September 25, 2024

I used the following arguments for run_eval.py.

python run_eval.py \ --model_name_or_path "distil-whisper/distil-large-v2" \ --dataset_name distil-whisper/common_voice_13_0 \ --dataset_config_name en \ --dataset_split_name test \ --text_column_name text \ --batch_size 128 \ --dtype "bfloat16" \ --generation_max_length 256 \ --language "en" \ --attn_implementation "flash_attention_2" \ --streaming True

from distil-whisper.

sanchit-gandhi commented on September 25, 2024

Hey @MLMonkATGY, after merging #132, I evaluated the model with the following:

#!/bin/bash

python run_eval.py \
    --model_name_or_path "distil-whisper/distil-large-v2" \
    --dataset_name "distil-whisper/common_voice_13_0" \
    --dataset_config_name "en" \
    --dataset_split_name "test" \
    --text_column_name "text" \
    --batch_size 128 \
    --dtype "bfloat16" \
    --generation_max_length 256 \
    --language "en" \
    --streaming True

And got a WER of 13.0%: https://wandb.ai/sanchit-gandhi/distil-whisper-speed-benchmark/runs/7qihyqbx?nw=nwusersanchitgandhi

This is within 0.1% of the 12.9% WER reported in the paper. This 0.1% difference is expected, since the paper WER results are in Flax on TPU, whereas the run_eval.py script is in PyTorch on GPU. There's an inherent difference in how matrix multiplications are implemented in both, giving a subtle difference in results. Note that all WER results from the paper are in Flax, so the comparison between large-v2 and distil-large-v2 is valid. Note also that all RTF values in the paper were computed in PyTorch on GPU, such that they're most applicable to downstream use cases. I hope that helps!

from distil-whisper.

sanchit-gandhi commented on September 25, 2024

All in all, the PR #132 should now mean that evaluating models in English with the PyTorch script run_eval.py gives WER results that are within 0.1% of the WER results quoted in the paper (using the Flax script flax/run_eval.py).

from distil-whisper.

MLMonkATGY commented on September 25, 2024

Thanks !

from distil-whisper.

Unable to reproduce results from the paper about distil-whisper HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	normalizer = (
	BasicTextNormalizer() if data_args.language is not None
	else EnglishTextNormalizer(processor.tokenizer.english_spelling_normalizer)
	)