Giter Site home page Giter Site logo

data_utils's Introduction

Data Utility For Large Scale Speech Data Processing

This repository contains scripts to process speech data in large scale.

Repository Structure

This repository includes a variety of scripts and directories each serving a different purpose in the workflow of the project.

Scripts and Folders

  • benchmark/ - Contains scripts and resources to benchmark the performance of the models or algorithms developed in this project.
  • create_hf/ - Scripts related to organizing and setting up models for the Hugging Face hub.
  • data_anno/ - Tools and scripts for data annotation processes.
  • hf_download/ - Utility scripts to download datasets from the Hugging Face hub.
  • in_house_ds/ - In-house data scripts that may include cleaning, preprocessing, or dataset-specific operations.
  • inference_tool/ - Scripts for running inference with the models trained in this project.
  • libri-light/ - Related to processing or using the Libri-Light dataset.
  • logging_tool/ - Utilities for logging the output of scripts and processes.
  • post_process/ - Scripts for post-processing the outputs of models or data transformation steps.
  • youtube/ - Scripts to handle downloading and processing data from YouTube.

Utility Scripts

  • data_check.sh - A shell script to validate the integrity and consistency of the data used in the project.
  • data_loader_sample.py - A Python script that serves as an example of how to load data within the project framework.
  • data_process_template.py - A template Python script for standardizing data processing tasks.
  • data_stats.py, data_stats_fast.py, data_stats_fusion.py - Python scripts for generating statistical summaries of the datasets.
  • get_total_hours.py - A Python script for calculating the total hours of data or processing time.
  • mount_synology_nas.sh - A shell script to mount Synology NAS drives, typically used for data storage and access.
  • my_dataloader.py - Custom script for loading data in a specific way required by the project.

data_utils's People

Contributors

lingy132 avatar

Watchers

Lin Geyu avatar

data_utils's Issues

minor normalization

sentence_timestamp = refine_script(word_segs, transcripts, speaker_map)

# ignore original transcripts as minor difference with the word segments
transcripts_from_word_segs = ''.join([w.word for w in word_segs])
sentence_timestamp = refine_script(word_segs, transcripts_from_word_segs, speaker_map)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.