Giter Site home page Giter Site logo

data's Introduction

USCDiarLibri

USCDiarLibri

We created USCDiarLibri dataset that can be used to test speaker diarization tasks with various customized setups and randomization.

USCLibriDiar dataset is based on artificial multi-party dialogs made from noisy, reverberated audio from the LibriSpeech database and it’s highly parameterized to allow for diverse conditions.

USCDiarLibri generates USCDiarLibri dataset using external speech corpora and noise dataset. Therefore, LibriSpeech data set and QUT-NOISE dataset should be downloaded to a certain folder before you run the data generation script.

Download and Installation

Data Preparation

(1) Download the following speech dataset:

(2) Download the following noise dataset:

(3) The directory which includes USCDiarLibri should be setup as the following.

 SCUBA-USCDiarLibri
  +--QUT-NOISE         
  |   +--QUT-NOISE-TIMIT/        
  |   +--QUT-NOISE-NIST2008/     
  |   +--QUT-NOISE/
  |   +--docs/
  |   +--code/
  +--LibriSpeech         
  |   +--train-clean-100/ 
  |   +--BOOKS.TXT
  |   +--CHAPTERS.TXT
  |   ...
  +--train-clean-100-json
  |   +--103/
  |   +--1034/
  |   +--1040/
  |   ...
  +--Libre_file_list.txt   
  +--QUT_noise_list.txt
  +--README.md
  +--data_creation_module.py
  +--USCDiarLibri_2_4.py
  +--USCDiarLibri_2_6.py
  +--USCDiarLibri_gen.py

Prerequisites

Creating USCLibriDiar Dataset

  • For pre-setup dataset, run the given python scripts.
$python USCDiarLibri_2_4.py  # two primary speakers, total 4 speakers
$python USCDiarLibri_2_6.py  # two primary speakers, total 6 speakers
  • For customizable dataset, modify the parameters in USCDiarLibri_gen.py. The parameters in USCDiarLibri_gen.py are defined in the form of python dictionary as below:
session_dict['parameter_name'] = [Value]
  • For the parameter descriptions, read the following descriptions.

Parameters and Descriptions

The following descriptions are for parameters of USCDiarLibri_gen.py. The randomization is done session by session.

librispeech_directory: String. The directory path for Downloaded LibriSpeech data.

noise_data_directory: String. The directory path for Downloaded QUT-NOISE data.

wav_output_directory: String. The directory path for generated .wav files.

verbose: Python Boolean: True or False. Display messages along the data generation process.

num_of_prime_spkrs: Positive integer. This parameter determines the number of primary speakers. Currently, the number of primary speakers is fixed to 2.

num_of_all_spkrs: Positive integer. The number of total speakers per a session. This number includes both primary speakers and interfering speakers.

dialogue_prob: Python list: probablility for the states of [Silence, Overlap, speaker 1, speaker 2, speaker 3, ..., speaker N]. If you set bigger probability to a certain state than others, the state will appear more frequently than other states.

number_of_spk_turns: Positive integer or -1. The number of speaker turns in a session. Put -1 if you want to create as many turns as possible. A turn means a change of state in artificial dialogue. For example, if there are three turns in a session, the example session could be speech signal of Speaker1 for 2.3sec followed by silence for 1.8sec followed by speech signal of speaker5 for 3.6sec.

dist_prob_range_prime_spk: Python list: [Min, Max]. Determines the range of uniform random variable for distance between two primary speakers.

dist_prob_range_bgr_spk: Python list: [Min, Max]. Determines the range of uniform random variable for distance between microphone and interfering speakers.

noise: Python Boolean: True or False. Toggle the background noise.

noise_gain_dB_range: Python list: [Min, Max]. Determines the range of uniform random variable for the Signal to Noise Ratio (SNR) in dB scale.

absorption range: Python list: [Min, Max]. Determines the range of uniform random variable for the absorption coefficient of virtual room that simulates impulse response. If you put 0, you get unechoic signal.

number_of_sess: Positive integer, -1 or -2. The number of sessions you want to create. If you put -1, the system generates maximum number of sessions. If you If you want to create the specific interval of sessions, use option of -2 and specify minimum and maximum index number.

start: Positive integer. Minimum index number.

end: Positive integer. Maximum index number.

file_id: String. Determines the tag for the name of the output file.

Generated Dataset

USCDiarLibri script generates three different kinds of files.

  • WAV file - session_[N]_ch[M].wav : Wav file contains output from each microphone. It contains speech signal from primary speakers, interfering speakers and noise.

  • JSON file - session_[N]_ch[M].json : json file that contains word alignment information for each channel. the information includes alignedword, start and end time, duration of each phoneme, and ending time.

  • RTTM file - session_[N].rttm : RTTM format is an evaluation format for NIST RichTranscription dataset. Please refer to The Rich Transcription 2006 Spring Meeting Recognition Evaluation

Contact Information

Taejin Park, University of Southern California [email protected]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.