Light

lingy12 / data_utils Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 8.37 MB

Python 89.58% Shell 7.80% Roff 0.17% Jupyter Notebook 2.45%

data_utils's Introduction

Data Utility For Large Scale Speech Data Processing

This repository contains scripts to process speech data in large scale.

Repository Structure

This repository includes a variety of scripts and directories each serving a different purpose in the workflow of the project.

Scripts and Folders

benchmark/ - Contains scripts and resources to benchmark the performance of the models or algorithms developed in this project.
create_hf/ - Scripts related to organizing and setting up models for the Hugging Face hub.
data_anno/ - Tools and scripts for data annotation processes.
hf_download/ - Utility scripts to download datasets from the Hugging Face hub.
in_house_ds/ - In-house data scripts that may include cleaning, preprocessing, or dataset-specific operations.
inference_tool/ - Scripts for running inference with the models trained in this project.
libri-light/ - Related to processing or using the Libri-Light dataset.
logging_tool/ - Utilities for logging the output of scripts and processes.
post_process/ - Scripts for post-processing the outputs of models or data transformation steps.
youtube/ - Scripts to handle downloading and processing data from YouTube.

Utility Scripts

data_check.sh - A shell script to validate the integrity and consistency of the data used in the project.
data_loader_sample.py - A Python script that serves as an example of how to load data within the project framework.
data_process_template.py - A template Python script for standardizing data processing tasks.
data_stats.py, data_stats_fast.py, data_stats_fusion.py - Python scripts for generating statistical summaries of the datasets.
get_total_hours.py - A Python script for calculating the total hours of data or processing time.
mount_synology_nas.sh - A shell script to mount Synology NAS drives, typically used for data storage and access.
my_dataloader.py - Custom script for loading data in a specific way required by the project.

data_utils's People

Contributors

Watchers

data_utils's Issues

minor normalization

data_utils/force_align/aligner.py

Line 73 in 65b67f1

sentence_timestamp = refine_script(word_segs, transcripts, speaker_map)

# ignore original transcripts as minor difference with the word segments
transcripts_from_word_segs = ''.join([w.word for w in word_segs])
sentence_timestamp = refine_script(word_segs, transcripts_from_word_segs, speaker_map)

typo

data_utils/force_align/aligner.py

Line 75 in 65b67f1

output_path = Path(audio).with_suffix('.algined.jsonl')

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.