Giter Site home page Giter Site logo

Comments (4)

guynich avatar guynich commented on May 26, 2024 1

For VoxPopuli I find raw_text to contain empty strings and think that is why my pseudo-labelling script failed after hours of compute with a ValueError("one or more references are empty strings").

For facebook/voxpopuli train split I find 5,463 empty "raw_text" strings of the 182,482 examples. Each empty "raw_text" string has a corresponding non-empty "normalized_text" string.

from distil-whisper.

guynich avatar guynich commented on May 26, 2024 1

Copied from #98
When pseudo-labelling the Voxpopuli dataset the "raw_text" (needed for option --text_column_name) may be an empty string for some examples - see HF dataset model card here for an empty "raw_text" example.

Question: how do I check which text name ("raw_text" or "normalized_text") was used when creating the pseudo-labelled datasets on HF, such as https://huggingface.co/datasets/distil-whisper/voxpopuli ?

from distil-whisper.

sanchit-gandhi avatar sanchit-gandhi commented on May 26, 2024 1

Hey @guynich - the provided transcriptions in the original VoxPopuli dataset are only used for computing the WER in the pseudo-labelling and distillation scripts. Since the WER is computed on normalised transcriptions, you can safely use the "normalized_text" column in the dataset, which is what was done for the Distil-Whisper datasets.

If you do decide to use the un-normalised (raw) text column, you should filter out any empty transcriptions from your dataset using a raw_datasets.filter method, e.g. as done here:

# 10.6: Filter training data with labels longer than `max_label_length`
def is_labels_in_length_range(labels):
return 0 < len(labels) <= max_label_length
filter_by_labels_fn = partial(
vectorized_datasets.filter, function=is_labels_in_length_range, input_columns=["labels"]
)
if accelerator.is_main_process:
vectorized_datasets = (
filter_by_labels_fn(num_proc=num_workers, desc="filtering train dataset")
if not data_args.streaming
else filter_by_labels_fn()
)

from distil-whisper.

guynich avatar guynich commented on May 26, 2024

Thank you for the helpful comment and for the fix #102. Closing.

from distil-whisper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.