Synthetic Diarization Corpus

Introduction

A synthetic corpus of dialogs was constructed from the LibriSpeech corpus, and is made freely available for diarization research. It includes over 90 hours of training data, and over 9 hours each of development and test data. Both 2-person and 3-person dialogs, with and without overlap, are included. Timing information is provided in several formats, and includes not only speaker segmentations, but also phoneme segmentations. As such, it is a useful starting point for general, particularly early-stage, diarization system development.

How to use

The corpus contains 4 top-level directories:
librispeech2: 2-person dialogs
librispeech2o: 2-person dialogs with overlap
librispeech3: 3-person dialogs
librispeech3o: 3-person dialogs with overlap

All sub-directories are "Kaldi table" data directories. Audio files are 16kHz PCM 16bit little-endian mono encoded.

Formats

ctm - each line is F C BT DUR word
Where:
F The waveform filename. NOTE: no pathnames or extensions are expected.
C Speaker.
BT The begin time (seconds) of the segment, measured from the start time of the file.
DUR The duration (seconds) of the segment.
labs - each line is a speaker id or 0 for pauses. One line corresponds 0.01 seconds of audio.
rttm0 - Rich Transcription Time Marked file format. Full specification can be found in Appendix A of "NIST's The 2009 (RT-09) Rich Transcription Meeting Recognition Evaluation Plan" paper.
rttm - merged rttm0, without pauses

This corpus is licensed under CC BY 4.0, but requires the following reference:

Edwards, E., Brenndoerfer, M., Robinson, A., Sadoughi, N., Finley, G. P., Korenevsky, M., Axtmann, N. & Suendermann-Oeft, D. (2018, September). A Free Synthetic Corpus for Speaker Diarization Research. In International Conference on Speech and Computer (pp. 113-122). Springer, Cham.

Bibtex

@inproceedings{edwards2018free,
  title={A Free Synthetic Corpus for Speaker Diarization Research},
  author={Edwards, Erik and Brenndoerfer, Michael and Robinson, Amanda and Sadoughi, Najmeh and Finley, Greg P and Korenevsky, Maxim and Axtmann, Nico and Miller, Mark and Suendermann-Oeft, David},
  booktitle={International Conference on Speech and Computer},
  pages={113--122},
  year={2018},
  organization={Springer}
}

Based on the LibriSpeech ASR corpus

wenwanchen / emrai-synthetic-diarization-corpus Goto Github PK

emrai-synthetic-diarization-corpus's Introduction

Synthetic Diarization Corpus

Introduction

How to use

Formats

Bibtex

emrai-synthetic-diarization-corpus's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent