Giter Site home page Giter Site logo

wannaphong / khanomtan-tts-v1.0 Goto Github PK

View Code? Open in Web Editor NEW
21.0 2.0 4.0 93.39 MB

KhanomTan TTS (ขนมตาล) is an open-source Thai text-to-speech model that supports multilingual speakers such as Thai, English, and others.

Home Page: https://wannaphong.github.io/KhanomTan-TTS-v1.0/

License: Apache License 2.0

HTML 37.65% Python 62.35%
text-to-speech thai-language thai-text-to-speech

khanomtan-tts-v1.0's Introduction

KhanomTan TTS v1.0

KhanomTan TTS (ขนมตาล) is an open-source Thai text-to-speech model that supports multilingual speakers such as Thai, English, and others.

KhanomTan TTS is a YourTTS model trained on multilingual languages that supports Thai. We use Thai speech corpora, TSync 1* and TSync 2* mbarnig/lb-de-fr-en-pt-12800-TTS-CORPUS to train the YourTTS model by using code from the 🐸 Coqui-TTS.

*Note: Those are not complete corpora. We can access the public corpus only which is smaller.

Demo https://wannaphong.github.io/KhanomTan-TTS-v1.0/

Model https://huggingface.co/wannaphong/khanomtan-tts-v1.0

Config

We use Thai characters to the graphemes config to training the model and use the Speaker Encoder model from 🐸 Coqui-TTS.

Dataset

We use Tsync 1 and Tsync 2 corpora, which are not complete datasets, and then add these to mbarnig/lb-de-fr-en-pt-12800-TTS-CORPUS dataset.

Trained the model

We use the 🐸 Coqui-TTS multilingual VITS-model recipe (version 0.7.1 or the commit id is d46fbc240ccf21797d42ac26cb27eb0b9f8d31c4) for training the model, and we use the speaker encoder model from 🐸 Coqui-TTS then we release the best model to public access.

Language

  • th-th: Thai
  • en: English
  • fr-fr: French language
  • pt-br: Portuguese
  • x-de: Danish
  • x-lb: Luxembourgish

Speaker and Language

  • en
    • female
      • Linda
    • male
      • p259
      • p274
      • p286
  • fr-fr
    • female
      • Bunny
      • Judith
    • male
      • Bernard
  • pt-br
    • male
      • Ed
  • th-th
    • female
      • Tsyncone
      • Tsynctwo
  • x-de
    • female
      • Judith
      • Kerstin
    • male
      • Thorsten
  • x-lb
    • female
      • Caroline
      • Judith
      • Nathalie
      • Sara
    • male
      • Charel
      • Guy
      • Jemp
      • Luc
      • Marco

Model Cards

Intended Use

  • Intended to be used for applications that need to use text-to-speech.
  • Intended to be made for an open-source project, startup, independent developer, student, and other. It is the basic tool for use in applications.
  • Not suitable for emotion text-to-speech, code-switch text-to-speech.

Factors

  • Based on quality audio, numbers of hours and other for text-to-speech task.
  • Evaluation factors is use the testset.

Metrics

  • avg_loader_time
  • avg_loss_disc
  • avg_loss_0
  • avg_loss_gen
  • avg_loss_kl
  • avg_loss_feat
  • avg_loss_mel
  • avg_loss_duration
  • avg_loss_1

Training Data

We use mbarnig/lb-de-fr-en-pt-12800-TTS-CORPUS corpus and merge with Tsync 1 corpus and Tsync 2 corpus (Not complete data because We can use the public dataset only, and they don't release complete corpus)

Evaluation Data

Fiew files sound from Tsync 1 and Tsync 2 corpora and mbarnig/lb-de-fr-en-pt-12800-TTS-CORPUS.

Quantitative Analyses

See more at https://huggingface.co/wannaphong/khanomtan-tts-v1.0/tensorboard.

Best Model

�[1m   --> STEP: 430/449 -- GLOBAL_STEP: 58800�[0m
     | > loss_disc: 2.54072  (2.58441)
     | > loss_disc_real_0: 0.19995  (0.18017)
     | > loss_disc_real_1: 0.16633  (0.20526)
     | > loss_disc_real_2: 0.23312  (0.22947)
     | > loss_disc_real_3: 0.22889  (0.24121)
     | > loss_disc_real_4: 0.24266  (0.23288)
     | > loss_disc_real_5: 0.27341  (0.23818)
     | > loss_0: 2.54072  (2.58441)
     | > grad_norm_0: 104.64307  (99.20876)
     | > loss_gen: 2.15378  (2.19429)
     | > loss_kl: 2.26462  (2.09749)
     | > loss_feat: 5.16728  (5.06811)
     | > loss_mel: 20.70769  (20.19465)
     | > loss_duration: 0.30896  (0.39851)
     | > loss_1: 30.60234  (29.95306)
     | > grad_norm_1: 1141.71912  (867.80768)
     | > current_lr_0: 0.00020 
     | > current_lr_1: 0.00020 
     | > step_time: 1.42650  (1.40457)
     | > loader_time: 0.01250  (0.01142)


�[1m > EVALUATION �[0m


  �[1m--> EVAL PERFORMANCE�[0m
     | > avg_loader_time:�[91m 0.00631 �[0m(+0.00023)
     | > avg_loss_disc:�[91m 2.58187 �[0m(+0.08440)
     | > avg_loss_disc_real_0:�[92m 0.07255 �[0m(-0.01495)
     | > avg_loss_disc_real_1:�[92m 0.17810 �[0m(-0.01394)
     | > avg_loss_disc_real_2:�[91m 0.21503 �[0m(+0.02537)
     | > avg_loss_disc_real_3:�[92m 0.21009 �[0m(-0.03936)
     | > avg_loss_disc_real_4:�[91m 0.21894 �[0m(+0.01136)
     | > avg_loss_disc_real_5:�[92m 0.21494 �[0m(-0.02851)
     | > avg_loss_0:�[91m 2.58187 �[0m(+0.08440)
     | > avg_loss_gen:�[92m 1.78352 �[0m(-0.16346)
     | > avg_loss_kl:�[92m 2.18065 �[0m(-0.13942)
     | > avg_loss_feat:�[91m 4.92250 �[0m(+0.03692)
     | > avg_loss_mel:�[92m 19.75920 �[0m(-1.00270)
     | > avg_loss_duration:�[91m 0.30307 �[0m(+0.00521)
     | > avg_loss_1:�[92m 28.94896 �[0m(-1.26345)

Last checkpoint

�[1m   --> STEP: 404/449 -- GLOBAL_STEP: 439975�[0m
     | > loss_disc: 2.31149  (2.37792)
     | > loss_disc_real_0: 0.16091  (0.13133)
     | > loss_disc_real_1: 0.16328  (0.19283)
     | > loss_disc_real_2: 0.20310  (0.21166)
     | > loss_disc_real_3: 0.23234  (0.22953)
     | > loss_disc_real_4: 0.20503  (0.21695)
     | > loss_disc_real_5: 0.23729  (0.22788)
     | > loss_0: 2.31149  (2.37792)
     | > grad_norm_0: 32.66961  (29.36955)
     | > loss_gen: 2.38829  (2.46154)
     | > loss_kl: 2.13480  (2.14247)
     | > loss_feat: 6.61947  (7.41335)
     | > loss_mel: 18.98203  (18.62947)
     | > loss_duration: 0.26565  (0.32838)
     | > loss_1: 30.39024  (30.97523)
     | > grad_norm_1: 380.15573  (318.00244)
     | > current_lr_0: 0.00018 
     | > current_lr_1: 0.00018 
     | > step_time: 1.36950  (1.60544)
     | > loader_time: 0.01070  (0.01238)


�[1m   --> STEP: 429/449 -- GLOBAL_STEP: 440000�[0m
     | > loss_disc: 2.36906  (2.37788)
     | > loss_disc_real_0: 0.17430  (0.13148)
     | > loss_disc_real_1: 0.21238  (0.19297)
     | > loss_disc_real_2: 0.23353  (0.21159)
     | > loss_disc_real_3: 0.22233  (0.22948)
     | > loss_disc_real_4: 0.22807  (0.21668)
     | > loss_disc_real_5: 0.23396  (0.22782)
     | > loss_0: 2.36906  (2.37788)
     | > grad_norm_0: 17.20346  (29.36594)
     | > loss_gen: 2.48922  (2.46176)
     | > loss_kl: 1.93436  (2.14287)
     | > loss_feat: 7.27229  (7.42032)
     | > loss_mel: 17.29741  (18.62724)
     | > loss_duration: 0.25341  (0.32484)
     | > loss_1: 29.24669  (30.97705)
     | > grad_norm_1: 148.36732  (319.84647)
     | > current_lr_0: 0.00018 
     | > current_lr_1: 0.00018 
     | > step_time: 1.36860  (1.59275)
     | > loader_time: 0.01080  (0.01229)

Ethical Considerations

This model can be used in multilingual language to speak with another language speaker or use voice clones, but the YourTTS model cannot synthesize voice like natural speakers. It can't synthesize voice at 44100 Hz. It depends on your ethics, so we don't recommend using this model in the wrong way. The lousy synthesis may be from the corpus or others because we can use the public dataset and graphemes for training the model. If you need the best synthesis, we suggest you collect your dataset with a large dataset (very much and very high quality) and train a new model with text-to-speech models, i.e., Tacotron or VITS.

Read about YourTTS: coqui-ai/TTS#1759.

Caveats and Recommendations

khanomtan-tts-v1.0's People

Contributors

mrpeerat avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

khanomtan-tts-v1.0's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.