Giter Site home page Giter Site logo

zhihengimrsv / dialectid_e2e Goto Github PK

View Code? Open in Web Editor NEW

This project forked from swshon/dialectid_e2e

0.0 0.0 0.0 115.22 MB

End to End Dialect Identification using Convolutional Neural Network

Python 72.74% Jupyter Notebook 19.13% Shell 8.13%

dialectid_e2e's Introduction

End-to-end Dialect Identification (implementation on MGB-3 Arabic dialect dataset)

Tensorflow implementation of End-to-End dialect identificaion in Arabic. If you are familiar with Language/Speaker identification/verification, it can be easily modified to another dialect, language or even speaker identification/verification tasks.

Requirement

  • Python, tested on 2.7.6
  • Tensorflow > v1.0
  • python library sox, tested on 1.3.2
  • python library librosa, tested on 0.5.1

Data list format

datalist consist of (location of wavfile) and (label in digit).

Example) "train.txt"

./data/wav/EGY/EGY000001.wav 0
./data/wav/EGY/EGY000002.wav 0
./data/wav/NOR/NOR000001.wav 4

Labels of Dialect:

  • Egytion (EGY) : 0
  • Gulf (GLF) : 1
  • Levantine(LAV): 2
  • Modern Standard Arabic (MSA) : 3
  • North African (NOR): 4

Dataset Augmentation

Augementation was done by two different method. First is random segment of the input utterance, and the other is perturbation by modifying speed and volume of speech.

Model definition

Simple description of the DNN model:

we used four 1-dimensional CNN (1d-CNN) layers (40x5 - 500x7 - 500x1 - 500x1 filter sizes with 1-2-1-1 strides and the number of filters is 500-500-500-3000) and two FC layers (1500-600) that are connected with a Global average pooling layer which averages the CNN outputs to produce a fixed output size of 3000x1.

End-to-end DID accuracy by epoch

End-to-end DID accuracy by epoch using augmented dataset

Performance comparison with and without Random Segmentation(RS)

Performance evaluation

Best performance is 73.39% on Accuracy. (Feb.28 2018)

for reference,

Conventional i-vector with SVM : 60.32%
Conventional i-vector with LDA and Cosine Distance : 62.60%
End-to-End model without dataset augmentation(MFCC): 65.55%
End-to-End model without dataset augmentation(FBANK): 64.81%
End-to-End model without dataset augmentation(Spectrogram): 57.57%

End-to-End model with volume perturbation(MFCC) : 67.49%
End-to-End model with speed perturbation(MFCC) : 70.51%

End-to-End model with speed and volume perturbation (MFCC) : 70.91%
End-to-End model with speed and volume perturbation (FBANK) : 71.92%
End-to-End model with speed and volume perturbation (Spectrogram) : 68.83%

End-to-End model with speed and volume perturbation+random segmention (MFCC) : 71.05%
End-to-End model with speed and volume perturbation+random segmention (FBANK) : 73.39%
End-to-End model with speed and volume perturbation+random segmention (Spectrogram) : 70.17%

Offline test

Offline test can be done in offline_test.ipynb code on our pretrained model. Specify wav file you want to identify Arabic dialect by modifying FILENAME variable.

FILENAME = ['/data/test/NOR_00001.wav']

Result can be shown like below bar plot of likelihood on 5 Arabic dialects.

Image of offline result plot

Relevant publication

[1] Suwon Shon, Ahmed Ali, James Glass,
Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition,
Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 98-104
https://arxiv.org/abs/1803.04567

Citing

@inproceedings{Shon2018,
  author={Suwon Shon and Ahmed Ali and James Glass},
  title={Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition	},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={98--104},
  doi={10.21437/Odyssey.2018-14},
  url={http://dx.doi.org/10.21437/Odyssey.2018-14}
}

dialectid_e2e's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.