andrew-fennell / cognative Goto Github PK

Translated vocal synthesis - Clone a voice and output speech in another language

Python 99.81% Batchfile 0.19%

ai ml python speech-synthesis text-to-speech translation tts voice

cognative's Introduction

Hi there, I'm Andrew! 👋

I recently graduated from Texas A&M University with a BS in Computer Engineering. Outside of work, I enjoy working on personal and open source projects.

About me

🚧　 I'm always working on projects for fun
💻 　I enjoy contributing to open source projects
📷 　I am learning photography and chess
🗻 　僕は日本語を勉強しています (I'm studying Japanese)

How I work

I am highly motivated and enjoy learning new things.
I study and learn well through habit building.
I enjoy checking things off the list. ✅
I like improving skills, languages, and technologies I've learned through putting them into action with projects.

GitHub stats

These are a few stats from the current year that show my contributions to (public) personal and open source projects.

Get in touch

If you have any questions about what I'm working on or just want to get in touch, feel free to contact me!

Email: [email protected]
Linkedin: https://www.linkedin.com/in/andrew-fennell/

cognative's People

Contributors

Stargazers

Watchers

Forkers

nonomal kevenleandro zaperking zhuang-jc yeshuawb3 bellyfat

cognative's Issues

There are overlapping statuses when multiple clones are made with the UI

Mentioned in #79

See here for examples: #79 (review)

Clean up unused RTVC files

There are some different GUI files and things that won't be used. These just need to be cleaned up before a "release".

Develop initial user interface

Create initial user interface.

File browser:

Source audio file path
Destination audio file path

Dropdown:

Source language (checkbox to auto detect language)
Destination language

Button:

Run vocal synthesis
Play audio

Some decisions need to be made about technologies. Those decision can be added here as comments.

Punctuation in abrreviations not accounted for

Punctuation splitting does not account for abbreviate splitting such as:

Dr.
Mr.
Mrs.
etc.
Sgt.

Fix RTVC printing colors

Some print colors are not ended.

For example, if something prints in green, it may print the next line in green.

Create hardware documentation

Hardware documentation should be created (preferably in .md format).

This documentation can replace the current README.md file at CogNative/CogNative/hardware.

Documentation should cover all of the hardware that we are using. It should also include anything that someone would need to run this project. (CUDA)

Feel free to set minimum specs in this document for other components, but I don't think it is required.

Add cross-language model to vocal synthesis

Add add least two cross-language models to vocal synthesis. This is a core function of this project, so this should be one of the primary focuses going forward.

The specific source languages can be chosen by the ML/AI team. The destination language should be English.

Make user interface improvements (if needed)

After an initial user interface is created, there may be new features by April 11th that need to be added.

If there are any adjustments to be made, this is a reminder to do so.

Create initial English-to-English vocal synthesis models

Using components that are already designed, create an initial English-to-English vocal synthesis model.

This should be able to clone distinct voices and synthesize text input to audio output.

Integrate speech_recognition module with vocal synthesis models.

Create a module that will handle all vocal synthesis interaction.

This should include:

Taking in audio file path
Choosing language (or auto identifying language, when that is added)
Transcribing the audio file (in any supported language)
Translating the audio file (to any supported language)
Synthesizing audio in that language, that mimics the the source voice
Handling the output location of those files

The output audio may be bad before an improved model is implemented for cross-language vocal synthesis to the destination language. This is okay. The point is to add a module that will handle all of these top-level things.

We will narrow the scope to specific source and destination languages that are supported by our vocal synthesis models when we begin creating those models.

Create initial prototype of speaker encoder

Create an initial prototype of the speaker encoder portion of the model.

Adjust README to reflect project proposal

The README needs to be adjusted. Some funny stuff should probably go ahead and be removed, the name should be added to reflect the project proposal, and the description can be adjusted to fit what we decided to do.

Create a checkpoint for the embedding

Separate audio by sentence (or silence)

Support module to help separate audio files for training purposes

Auto-detect source audio language

The first 10 seconds could be clipped (if the source audio is longer than 10 seconds) to auto-detect language of source audio.

This would allow us to remove another input from the UI and CLI.

This is my issue

This is a new issue by Aref.

Add punctuation to speech-to-text module

Currently there is no punctuation... I think this will pose a problem when trying to generate audio that sounds normal.

Integrate STT and Translation

Functions for audio in different languages to be transcribed and translated (using the modules already developed for STT and translation)

Improve code structure

Test available APIs for text translation

Test and determine the best text translation solution

Noise reduction in vocal synthesis

Currently, the synthesized voice contains significant static.

There are libraries that we can use to improve the sound quality and remove static. This would make the listener's experience significantly better.

Add user input for each input in main.py

The user should be able to input:

Source Language
Input Audio File
Output Audio File
Use former embedding? (if available)
Text to be synthesized

Make basic project structure

Add language auto-detection

Add a language auto-detection feature. This should auto-detect the language being spoken in an input audio file (preferably a .wav file).

Add .txt file input support to main.py

Example

Setup new hardware

NUC may not work anymore. We need to setup new hardware, with the help of our professor, and secure a place to store the hardware.

Commented out initialization of RTVC in main.py

I accidentally commented out the RTVC initialization in my last PR.

Test available APIs for speech-to-text processing

Test and determine the best solutions for speech-to-text

Make GitHub commands/tips list

Return file cut short when testing autotranslation

When testing autotranslation, I used the following command on a one-minute spanish file cut from the beginning of Angelina.wav and received a five second, one-sentence output instead of the full length audio. The file is inside the zip below as well.

(I used the one minute file to save time/money but also because the original 25 minute file returns an unrelated error about being too long. )

python -m CogNative.main -sampleAudio "I:\github\repo\CogNative\CogNative\examples\AngelinaShort.wav" -synType audio -dialogueAudio "I:\github\repo\CogNative\CogNative\examples\AngelinaShort.wav" -out "I:\github\repo\CogNative\CogNative\examples\AngelinaShortClone.wav" -useExistingEmbed y

AngelinaShort.zip

Connor - Name not on README.md

I need to add my name

Write tests for backend modules

Write tests in pytest for all backend functions

Add non-standard punctuation to vocal synthesis

Test Hardware

Hardware needs to be tested.

Test GPU for functionality
can we access the machine?
is it updated?

Translation text requests have a limit of 5000 characters

This is several minutes of audio worth of characters, but some input will be longer than this.

We need to make this work with more than 5000 characters.

Add functionality to UI

The UI is currently taking user input, but it isn't actually producing a cloned output.

I think the easiest (maybe not the best) route is to call main.py when the "Clone voice" button is clicked.

I think you can do this through the sys library. Read stdout to display success or errors on the UI.

destLang argument in main.py not in readme

No ending punctuation issue

If there is not an ending punctuation, there is no temp output chunk, which leads to this error:

I don't think this is the case for all inputs, so here is the input to demo this issue:

Input text that fails
Avsikten hos gemenskapen är att utgå från ett ‘globalt perspektiv’, där förhållanden i Sverige inte ska ses som normerande eller viktigare än de i omvärlden

Input text that works
Avsikten hos gemenskapen är att utgå från ett ‘globalt perspektiv’, där förhållanden i Sverige inte ska ses som normerande eller viktigare än de i omvärlden.

Jacob-test

add name to readme.md

Spaces in input or output paths cause GUI to error

Add text file input capability

Dialogue added as a .txt file would be a good improvement.

Convert version of main.py to a callable function for UI

I think main.py will need to be importable and callable to work with the UI.

This issue can be adjusted if we decide we need to go a different direction.

Finalize Pre-User Testing cross-language models

Finalize all cross-language models that will be used for user testing.

Add documentation for all supported models.

Audio file vocal synthesis INPUT cannot exceed a certain length

Problem

Audio files that are over a minute long do not work as vocal synthesis text input. (if you give an audio file to "copy" the words from, rather than providing text directly)

Error:

Proposed solution

Cut the provided audio into segments
Transcribe each audio segment
Combine transcriptions

This could run into issues with words and sentences being cut, which would decrease the quality of the transcriptions.