Giter Site home page Giter Site logo

speech-editing-toolkit's People

Contributors

zain-jiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech-editing-toolkit's Issues

请问支持多段同时编辑吗

比如
句子a:我叫小明,我买了从北京到上海的票。
句子b:我叫小红,我买了从天津到广东的票。

我想同时把 小明 替换成 小红,北京 替换成 天津,上海 替换成 广东,不知道能否实现

During inference, is it generated as a whole or partially?

Hi,
I recently discovered in experiments that the audio generated using fluentspeech is also different from the original audio in the non-modified parts. Here are the waveforms of the unmodified portion of both, with the original audio above and the generated audio below.
image

However, in the fluentspeech paper, it is stated that the masked part is generated in Reverse Diffusion, and the non-mask part has not been changed.Here is the figure in the paper.
image

Is the generation generated as a whole or partially?

Is there a Chinese example

Dear Jiang,

#9 (comment)

You mentioned the current package support Chinese also, not sure if you can share an example please? Thanks for your excellent work. Today, after spent whole day, and I finally finished the installation of MFA, and also successfully built a model by running "run_mfa_train_align.sh". I plan to build a Chinese model soon, any suitable dataset you will suggest? Like Aishell2?

Kelvin

an error when trying to infer with spec_denoiser.py

Thanks for your excellent work, but I encountered an error when trying to infer with python inference/tts/spec_denoiser.py

Traceback (most recent call last):
File "inference/tts/spec_denoiser.py", line 272, in
StutterSpeechInfer.example_run()
File "inference/tts/spec_denoiser.py", line 259, in example_run
wav_out, wav_gt, mel_out, mel_gt, masked_mel_out, masked_mel_gt = infer_ins.infer_once(inp)
File "/data3/liukaiyang/Speech-Editing-Toolkit/inference/tts/base_tts_infer.py", line 97, in infer_once
output = self.forward_model(inp)
File "inference/tts/spec_denoiser.py", line 119, in forward_model
output = self.model(edited_txt_tokens, time_mel_masks=time_mel_masks, mel2ph=edited_mel2ph, spk_embed=sample['spk_embed'],
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/spec_denoiser.py", line 159, in forward
ret = self.fs(txt_tokens, time_mel_masks, mel2ph, spk_embed, f0, uv, energy,
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 92, in forward
mel2ph = self.forward_dur(dur_inp, time_mel_masks, mel2ph, txt_tokens, ret, use_pred_mel2ph=use_pred_mel2ph)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 137, in forward_dur
masked_dur_gt = mel2token_to_dur(mel2ph
(1-time_mel_masks).squeeze(-1).long(), T) * nonpadding
File "/data3/liukaiyang/Speech-Editing-Toolkit/utils/audio/align.py", line 85, in mel2token_to_dur
dur = mel2token.new_zeros(B, T_txt + 1).scatter_add(1, mel2token, torch.ones_like(mel2token))
RuntimeError: index 86 is out of bounds for dimension 1 with size 86

VCTK checkpoint

Thanks for the great work on this repository, really useful!

Wondering if there is a VCTK checkpoint that could be accessed, for use with speakers with UK accent?

Again thanks for this repository!

Automatic Stutter Removal

Are a text transcript, defined region, and defined edited_region all required for inference and training on automatic stutter removal? Is there any way to provide only the raw audio and destutter it? If so, would this be done by running spec_denoiser.py or another script?

Where to find mfa_dict.txt and mfa_model.zip?

Hi! I'm getting the following error when running python inference/tts/spec_denoiser.py --exp_name spec_denoiser. Where can I find the required files? I'm trying to run the basic pre-trained inference of FluentSpeech.

Traceback (most recent call last):
  File "inference/tts/spec_denoiser.py", line 350, in <module>
    dataset_info = data_preprocess(test_file_path, test_wav_directory, dictionary_path, acoustic_model_path,
  File "inference/tts/spec_denoiser.py", line 297, in data_preprocess
    assert os.path.exists(file_path) and os.path.exists(input_directory) and os.path.exists(acoustic_model_path), \
AssertionError: inference/example.csv,inference/audio,data/processed/libritts/mfa_dict.txt,data/processed/libritts/mfa_model.zip

inference_acl missing

from inference_acl.tts.infer_utils import get_align_from_mfa_output, extract_f0_uv

ModuleNotFoundError: No module named 'inference_acl'

A typo in README

In Data Preprocess part of README, the second instruction has a spell mistake: run_mfa_train_align.sh instead of run_mfa_train_aligh.sh

Any checkpoints

Hi, thanks for this amazing work, I am just wondering if we can provide any checkpoints to run for speech editing.

speech edit on Arabic audio

Hello, thanks for the great repo!

If I want to edit arabic speech. What do you suggest for best practices ?
Train/finetune FluentSpeech on Arabic audio and keep the vocoder as is ?
Also, what about the code, do I need to change some files for it to be able to edit Arabic speech ?

How to process and inference?

Hi,
I used the pre-trained model for reasoning and found that mfa_model.zip and mfa_dict.txt were missing. I downloaded the relevant models from the official mfa and created folders by myself to put them in.

However, the output shows:
image

Do I need to perform the data processing part first?
After entering the following command:
image
show:
image

How should I solve this problem and I need help with!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.