To pseudo-label the three open-source datasets I had to re-order the <a href="https://

Copied from <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Thank you for the helpful comment and for the fix <a class="issue-link js-issue-link"

Training README datasets table: text column and id column about distil-whisper HOT 4 CLOSED

guynich commented on May 26, 2024

Training README datasets table: text column and id column

from distil-whisper.

Comments (4)

guynich commented on May 26, 2024 1

For VoxPopuli I find raw_text to contain empty strings and think that is why my pseudo-labelling script failed after hours of compute with a ValueError("one or more references are empty strings").

For facebook/voxpopuli train split I find 5,463 empty "raw_text" strings of the 182,482 examples. Each empty "raw_text" string has a corresponding non-empty "normalized_text" string.

from distil-whisper.

guynich commented on May 26, 2024 1

Copied from #98
When pseudo-labelling the Voxpopuli dataset the "raw_text" (needed for option --text_column_name) may be an empty string for some examples - see HF dataset model card here for an empty "raw_text" example.

Question: how do I check which text name ("raw_text" or "normalized_text") was used when creating the pseudo-labelled datasets on HF, such as https://huggingface.co/datasets/distil-whisper/voxpopuli ?

from distil-whisper.

sanchit-gandhi commented on May 26, 2024 1

Hey @guynich - the provided transcriptions in the original VoxPopuli dataset are only used for computing the WER in the pseudo-labelling and distillation scripts. Since the WER is computed on normalised transcriptions, you can safely use the "normalized_text" column in the dataset, which is what was done for the Distil-Whisper datasets.

If you do decide to use the un-normalised (raw) text column, you should filter out any empty transcriptions from your dataset using a raw_datasets.filter method, e.g. as done here:

distil-whisper/training/run_distillation.py

Lines 1224 to 1236 in b948d02

    
           # 10.6: Filter training data with labels longer than `max_label_length` 
        
           def is_labels_in_length_range(labels): 
        
               return 0 < len(labels) <= max_label_length 
        
           filter_by_labels_fn = partial( 
        
               vectorized_datasets.filter, function=is_labels_in_length_range, input_columns=["labels"] 
        
           ) 
        
           if accelerator.is_main_process: 
        
               vectorized_datasets = ( 
        
                   filter_by_labels_fn(num_proc=num_workers, desc="filtering train dataset") 
        
                   if not data_args.streaming 
        
                   else filter_by_labels_fn() 
        
               )

from distil-whisper.

guynich commented on May 26, 2024

Thank you for the helpful comment and for the fix #102. Closing.

from distil-whisper.

Recommend Projects

Training README datasets table: text column and id column about distil-whisper HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# 10.6: Filter training data with labels longer than `max_label_length`
	def is_labels_in_length_range(labels):
	return 0 < len(labels) <= max_label_length

	filter_by_labels_fn = partial(
	vectorized_datasets.filter, function=is_labels_in_length_range, input_columns=["labels"]
	)
	if accelerator.is_main_process:
	vectorized_datasets = (
	filter_by_labels_fn(num_proc=num_workers, desc="filtering train dataset")
	if not data_args.streaming
	else filter_by_labels_fn()
	)