Giter Site home page Giter Site logo

Dot (.) stops synthesis about larynx HOT 7 CLOSED

rhasspy avatar rhasspy commented on July 17, 2024
Dot (.) stops synthesis

from larynx.

Comments (7)

chainria avatar chainria commented on July 17, 2024 1

Hi!

I tried several voices in the GUI, they all did it. Didn't try all of them, but I can certainly give it a shot. I am using harvard right now.
Edit: ALL of the voices I could try in the interface exhibit this problem. "Hello. This is two sentences." simply yields "Hello."

This is Home Assistant OS running on a Raspberry PI 4B with 4GB of RAM. I don't know how to start the script on the CLI since it is using Docker containers. If there is a way, I can gladly try.
Edit: Tried using rhasspy. This does work like a treat. So it almost looks line an issue with OpenTTS?

I encountered it with song titles as well as a simple "Testing. Testing. Testing." and it stops at the first sentence. I also tried pasting multiple sentences from anywhere and it stopped at the first dot.

from larynx.

follower avatar follower commented on July 17, 2024

Hi, to make this issue easier to debug it might be helpful to supply some additional information:

  • Does this happen with all voices/vocoders? If not, which ones, specifically?
  • Does it happen when using the larynx script from the command line? e.g. larynx -v en "Hello. This is two sentences."

Certainly in my experience Larynx has synthesized multiple sentences without special handling, so there might be something about the setup that's not working properly.

What operating system/version is this ocurring on?

(Also, are these song titles? Have you tested with typing sample sentences directly in case there's an issue with possible hidden/special characters in the title?)

from larynx.

follower avatar follower commented on July 17, 2024

Thanks for trying those other approaches & reporting back.

Based on your descriptions it does seem likely to be an issue around the OpenTTS integration.

I don't have any experience with that aspect of this project so can't give you any specific help for that, sorry.

In terms of debugging approach I'd look at how the text string gets passed through the different parts of the system to see if part of it is getting dropped along the way--maybe see if Home Assistant/OpenTTS logs the input/output text data during processing to see where/if it changes?

from larynx.

chainria avatar chainria commented on July 17, 2024

Thanks! I already assumed that I'll need to report this in OpenTTS itself, just thought I had to start somewhere. And since it doesn't seem to be larynx itself, I'll try that. Also I found how to enable debug and it seems it synthesizes the text in three completely different runs.

--debug --larynx-quality high --larynx-noise-scale 0.333 --larynx-length-scale 1.0
DEBUG:opentts:Namespace(cache=None, debug=True, flite_voices_dir=None, host='0.0.0.0', larynx_denoiser_strength=0.001, larynx_length_scale=1.0, larynx_noise_scale=0.333, larynx_quality='high', marytts_like=None, marytts_url=None, mozillatts_url=None, no_espeak=False, no_festival=False, no_flite=False, no_larynx=False, no_nanotts=False, port=5500)
DEBUG:opentts:Loaded TTS systems: espeak, flite, festival, nanotts, marytts, larynx
Running on 0.0.0.0:5500 over http (CTRL + C to quit)
DEBUG:opentts:['espeak-ng', '--voices']
DEBUG:opentts:Festival voices: {'kal_diphone'}
DEBUG:opentts:Loading voices from voices/marytts
DEBUG:opentts:Voice(id='bits1-hsmm', name='bits1-hsmm', gender='female', language='de', locale='de', tag=None)
DEBUG:opentts:Voice(id='dfki-pavoque-neutral-hsmm', name='dfki-pavoque-neutral-hsmm', gender='male', language='de', locale='de', tag=None)
DEBUG:opentts:Voice(id='bits3-hsmm', name='bits3-hsmm', gender='male', language='de', locale='de', tag=None)
DEBUG:opentts:['espeak-ng', '--voices']
DEBUG:opentts:Festival voices: {'kal_diphone'}
INFO:opentts:Synthesizing with larynx:eva_k-glow_tts (23 char(s))...
DEBUG:opentts:Synthesizing line 1 (23 char(s))
DEBUG:gruut.toksen:Number converter regex: ^-?\d+([,.]\d+)*\w+$
DEBUG:gruut.phonemize:Loading lexicon from voices/larynx/gruut/de-de/lexicon.db
DEBUG:glow_tts:Loading model from voices/larynx/de-de/eva_k-glow_tts/generator.onnx
DEBUG:hifi_gan:Loading HiFi-GAN model from voices/larynx/hifi_gan/vctk_small/generator.onnx
DEBUG:opentts:TTS settings: {'noise_scale': 0.333, 'length_scale': 1.0}
DEBUG:opentts:Vocoder settings: {'denoiser_strength': 0.001}
DEBUG:larynx:{'_': 0, '|': 1, '‖': 2, '#': 3, 'a': 4, 'aɪ̯': 5, 'aʊ̯': 6, 'aː': 7, 'b': 8, 'd': 9, 'd͡ʒ': 10, 'eː': 11, 'f': 12, 'g': 13, 'h': 14, 'iː': 15, 'j': 16, 'k': 17, 'l': 18, 'm': 19, 'n': 20, 'oː': 21, 'p': 22, 'p͡f': 23, 's': 24, 't': 25, 't͡s': 26, 't͡ʃ': 27, 'uː': 28, 'v': 29, 'x': 30, 'yː': 31, 'z': 32, 'ãː': 33, 'ç': 34, 'õː': 35, 'øː': 36, 'ŋ': 37, 'œ': 38, 'ɐ': 39, 'ɔ': 40, 'ɔʏ̯': 41, 'ə': 42, 'ɛ': 43, 'ɛː': 44, 'ɛ̃ː': 45, 'ɪ': 46, 'ʁ': 47, 'ʃ': 48, 'ʊ': 49, 'ʏ': 50, 'ʒ': 51, 'ʔ': 52, 'χ': 53}
DEBUG:larynx:Words for 'Test.': ['test', '.']
DEBUG:larynx:Phonemes for 'Test.': ['#', 't', 'ɛ', 's', 't', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Words for 'Eins.': ['eins', '.']
DEBUG:larynx:Phonemes for 'Eins.': ['#', 'a', 'eː', 'n', 's', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Words for 'Zwei.': ['zwei', '.']
DEBUG:larynx:Got mels in 0.19924291200004518 second(s) (shape=(1, 80, 48))
DEBUG:larynx:Phonemes for 'Zwei.': ['#', 't͡s', 'v', 'aɪ̯', '#', '‖', '‖']
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:larynx:Words for 'Drei.': ['drei', '.']
DEBUG:larynx:Phonemes for 'Drei.': ['#', 'd', 'ʁ', 'aɪ̯', '#', '‖', '‖']
DEBUG:larynx:Got mels in 0.29696504096500576 second(s) (shape=(1, 80, 62))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Initializing denoiser
DEBUG:hifi_gan:Initializing denoiser
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 1.1020990899996832 second(s) (shape=(12288,))
DEBUG:larynx:Real-time factor: 0.42 (audio=0.56 sec, infer=1.31 sec)
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:opentts:Got 24620 WAV byte(s) for line 1
DEBUG:opentts:Synthesized 24620 byte(s) in 9.16156530380249 second(s)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 1.1691214450402185 second(s) (shape=(15872,))
DEBUG:larynx:Real-time factor: 0.49 (audio=0.72 sec, infer=1.47 sec)
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 0.27170646691229194 second(s) (shape=(1, 80, 46))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:larynx:Got mels in 0.27694296499248594 second(s) (shape=(1, 80, 48))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.377510052989237 second(s) (shape=(11776,))
DEBUG:larynx:Real-time factor: 0.82 (audio=0.53 sec, infer=0.65 sec)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.28667956008575857 second(s) (shape=(12288,))
DEBUG:larynx:Real-time factor: 0.99 (audio=0.56 sec, infer=0.57 sec)
INFO:opentts:Synthesizing with larynx:rebecca_braunert_plunkett-glow_tts (23 char(s))...
DEBUG:opentts:Synthesizing line 1 (23 char(s))
DEBUG:glow_tts:Loading model from voices/larynx/de-de/rebecca_braunert_plunkett-glow_tts/generator.onnx
DEBUG:opentts:TTS settings: {'noise_scale': 0.333, 'length_scale': 1.0}
DEBUG:opentts:Vocoder settings: {'denoiser_strength': 0.001}
DEBUG:larynx:Words for 'Test.': ['test', '.']
DEBUG:larynx:Phonemes for 'Test.': ['#', 't', 'ɛ', 's', 't', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Words for 'Eins.': ['eins', '.']
DEBUG:larynx:Phonemes for 'Eins.': ['#', 'a', 'eː', 'n', 's', '#', '‖', '‖']
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Words for 'Zwei.': ['zwei', '.']
DEBUG:larynx:Phonemes for 'Zwei.': ['#', 't͡s', 'v', 'aɪ̯', '#', '‖', '‖']
DEBUG:larynx:Words for 'Drei.': ['drei', '.']
DEBUG:larynx:Phonemes for 'Drei.': ['#', 'd', 'ʁ', 'aɪ̯', '#', '‖', '‖']
DEBUG:larynx:Got mels in 0.1456335949478671 second(s) (shape=(1, 80, 28))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:larynx:Got mels in 0.17054839001502842 second(s) (shape=(1, 80, 30))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.20584573596715927 second(s) (shape=(7168,))
DEBUG:larynx:Real-time factor: 0.92 (audio=0.33 sec, infer=0.35 sec)
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:opentts:Got 14380 WAV byte(s) for line 1
DEBUG:opentts:Synthesized 14380 byte(s) in 5.937345743179321 second(s)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.2447036859812215 second(s) (shape=(7680,))
DEBUG:larynx:Real-time factor: 0.83 (audio=0.35 sec, infer=0.42 sec)
DEBUG:larynx:Running text to speech model (GlowTextToSpeech)
DEBUG:larynx:Got mels in 0.15959876799024642 second(s) (shape=(1, 80, 28))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:larynx:Got mels in 0.16664397495333105 second(s) (shape=(1, 80, 26))
DEBUG:larynx:Running vocoder model (HiFiGanVocoder)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.22502166801132262 second(s) (shape=(7168,))
DEBUG:larynx:Real-time factor: 0.84 (audio=0.33 sec, infer=0.39 sec)
DEBUG:hifi_gan:Running denoiser (strength=0.001)
DEBUG:larynx:Got audio in 0.1921218209899962 second(s) (shape=(6656,))
DEBUG:larynx:Real-time factor: 0.84 (audio=0.30 sec, infer=0.36 sec)

from larynx.

synesthesiam avatar synesthesiam commented on July 17, 2024

Yep, this appears to be a bug in the OpenTTS integration. I messed up and assumed that sentences were split in a different place. I'll get this cleaned up and release a new version.

from larynx.

chainria avatar chainria commented on July 17, 2024

Thank you very much! I am looking forward to it :)

from larynx.

synesthesiam avatar synesthesiam commented on July 17, 2024

Should be fixed now in OpenTTS 2.1 👍

from larynx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.