Giter Site home page Giter Site logo

audioldm2's Introduction

AudioLDM 2

arXiv githubio Hugging Face Spaces

This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.


Change Log

  • 2023-08-27: Add two new checkpoints!
    • 🌟 48kHz AudioLDM model: Now we support high-fidelity audio generation! Hugging Face Spaces
    • 16kHz improved AudioLDM model: Trained with more data and optimized model architecture.

TODO

  • Add the text-to-speech checkpoint
  • Open-source the AudioLDM training code.
  • Support the generation of longer audio (> 10s)
  • Optimizing the inference speed of the model.
  • Integration with the Diffusers library (see 🧨 Diffusers)
  • Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as AudioLDMv1)

Web APP

  1. Prepare running environment
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
  1. Start the web application (powered by Gradio)
python3 app.py
  1. A link will be printed out. Click the link to open the browser and play.

Commandline Usage

Installation

Prepare running environment

# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git

If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by

sudo apt-get install espeak

Run the model in commandline

  • Generate sound effect or Music based on a text prompt
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
  • Generate sound effect or music based on a list of text
audioldm2 -tl batch.lst
  • Generate speech based on (1) the transcription and (2) the description of the speaker
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"

Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.

Random Seed Matters

Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.

audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Pretrained Models

You can choose model checkpoint by setting up "model_name":

# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

We have five checkpoints you can choose:

  1. audioldm2-full (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
  2. audioldm_48k: This checkpoint can generate high fidelity sound effect and music.
  3. audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
  4. audioldm2-full-large-1150k: Larger version of audioldm2-full.
  5. audioldm2-music-665k: Music generation.
  6. audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
  7. audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:

  • cpu
  • cuda
  • mps ( Notice that the computation requires about 20GB of RAM. )

Other options

  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED]

  optional arguments:
    -h, --help            show this help message and exit
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
    -dur DURATION, --duration DURATION
                        The duration of the samples
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.

Hugging Face 🧨 Diffusers

AudioLDM 2 is available in the Hugging Face 🧨 Diffusers library from v0.21.0 onwards. The official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

The Diffusers version of the code runs upwards of 3x faster than the native AudioLDM 2 implementation, and supports generating audios of arbitrary length.

To install 🧨 Diffusers and 🤗 Transformers, run:

pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate

You can then load pre-trained weights into the AudioLDM2 pipeline, and generate text-conditional audio outputs by providing a text prompt:

from diffusers import AudioLDM2Pipeline
import torch
import scipy

repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

Tips for obtaining high-quality generations can be found under the AudioLDM 2 docs, including the use of prompt engineering and negative prompting.

Tips for optimising inference speed can be found in the blog post AudioLDM 2, but faster ⚡️.

Cite this work

If you found this tool useful, please consider citing

@article{audioldm2-2024taslp,
  author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining}, 
  year={2024},
  volume={32},
  pages={2871-2883},
  doi={10.1109/TASLP.2024.3399607}
}
@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
  pages={21450-21474}
}

audioldm2's People

Contributors

carlthome avatar eltociear avatar haoheliu avatar kamilake avatar lingchul avatar microboym avatar sanchit-gandhi avatar shyamsantoki avatar steve235lab avatar wing0529 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

audioldm2's Issues

Trainig code?

Great work! Are you planing on releasing the model training / fine-tuning code? Or is this not the case, as with AudioLDM1?

Whether it's prompt or not affects the results generated?

Due to the prompt, does it happen that I would have wanted to generate sound effects, but instead I generate music, because the quality of the prompt affects the generated result a lot, or does it correspond to a specific task where I can choose the corresponding model? It's in your readme, but it doesn't seem to be an option in the huggingface api!

Reproducing results on AudioCaps

Thanks for the great work!

What are the suggested inference hyperparameters / checkpoint to reproduce the results on AudioCaps? I was trying to generate audio via audioldm2 --model_name MODEL -t CAPTION in AudioCaps test set but was unable to get the same FAD/KL (1.42/0/98) in Table 1. Tried audioldm2-full and audioldm2-full-large-1150k with the default inference hyperparameters but their FAD/KL are ~2.7/1.3.

Cuda out of memory?

Could you please tell me how much GPU memory is required for model inference

Web app just keeps processing forever and not doing anything at all.

Everything works fine after typing "py app.py"

I even got the downloads of the test audios that it creates the first time you run it, and they sound fine.

The issue arises whenever I actually try to create anything of my own. I get the link for the web app. I type it in, it pulls up fine. I enter the description for what I want to create, change my settings, it starts processing, and then.........................

It never finishes, ever. It doesn't even put a load on my GPU. My GPU was maxing out when launching "py app.py" initially, but the web app doesn't affect it. It's almost as if there's no communication between the web app and my GPU at all.

I left it going for 10+ minutes hoping something would happen, but it never does.

Any help would be greatly appreciated.

AudioLDM2 via Pinokio -- CUDA out of memory...

I was excited to install AudioLDM2 via Pinokio, but unfortunately, I'm encountering issues with "CUDA out of memory..." after starting.
PyTorch reserves the entire amount of memory available on the GPU, regardless of its size.

I'm unable to proceed further. Do you have any tips or suggestions to help me resolve this issue?

Thank you in advance!

How can I train this model by myself ?

I want to specialize the model for generating what I want. Please teach me how to train this model.

Additionally, please also explain how to generate audio from images.

Model dropdown does not work

Start app.py
Click "Click to modify detailed configruations"
Last option says "Dropdown" but only 1 model is shown and it does not drop down to select the other models.

Optimisation 3: Torch Compile in blog fails

Optimisation 3: Torch Compile in blog () which is mentioned in README.md produced error messages when I add the following three statements in the the Colab noteook:

https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

torch._dynamo.config.suppress_errors = True

audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

The error messages are:

[2023-10-12 08:32:00,552] torch._dynamo.convert_frame: [ERROR] WON'T CONVERT forward /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py line 664
due to:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 403, in dyn_shape
raise DynamicOutputShapeException(func)
torch._subclasses.fake_tensor.DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1206, in run_node
raise RuntimeError(
RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py", line 71, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(

Set torch._dynamo.config.verbose=True for more information


DynamicOutputShapeException Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py in run_node(output_graph, node, args, kwargs, nnmodule)
1198 assert nnmodule is not None
-> 1199 return nnmodule(*args, **kwargs)
1200 elif op == "get_attr":

55 frames

DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Unsupported Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py in unimplemented(msg)
69 def unimplemented(msg: str):
70 assert msg != os.environ.get("BREAK", False)
---> 71 raise Unsupported(msg)
72
73

Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(

Set torch._dynamo.config.verbose=True for more information

Function to go from raw audio to LOA

Is there any function in the code to go from raw audio to LOA? From the paper, I understand this is done by computing the mel-spec, passing it through the pre-trained MAE, and then doing some pooling. I'm trying to reverse-engineer this from the code but it's not trivial. Any help would be greatly appreciated :)

On Windows 11, running audioldm2 gives error "no modules named audioldm2.__main__"

From Windows 11 command line, after creating conda env and pip install of AudioLDM2.git, I try to run:

audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody.

and get the error:

C:\Users\myusername\AppData\Local\anaconda3\envs\audioldm\python.exe: No module named audioldm2.__main__; 'audioldm2' is a package and cannot be directly executed

Runtime error: espeak not installed on your system

I am on a M1 macbook air on Monterey and followed the latest install instructions.
when I run the audiodlm2 example command "Musical constellations ...etc" I get this runtime error.

When I try "pip3 install espeak" no matching distribution is found, if I run "pip3 install phonemizer" it says it is already installed....
Any ideas?

Installation and running error on M2 Mac

I am trying to install the repo and get it running on my M2 Mac but get the following error.

OSError: dlopen(/opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at8internal15invoke_parallelExxxRKNSt3__18functionIFvxxEEE Referenced from: <F096D2C3-ADC0-3EF4-ACF6-E3075A1DF8EE> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so Expected in: <F444C1C4-7CAA-34AA-AA17-B5ED7975BD31> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib

I have tried both webapp and command line options but receiving the same error.

While trying to look for a solution, I came across this SO post which is sorta related and might help.
https://stackoverflow.com/questions/73370909/m1-mac-returns-oserror-library-not-loaded

Posting it here in case someone found a solution already.

Unable to run app.py

After installing dependencies and run app.py I get it to download the default model and then:
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
Unexpected key(s) in state_dict: "clap.model.text_branch.embeddings.position_ids", "cond_stage_models.0.cond_stage_models.0.model.text_branch.embeddings.position_ids".

Any idea why this happens and how to resolve?

Possible Google Colab?

Can you run this on a Colab?

And how do you change the length of the piece in huggingface :)

Style Transfer

Thank you for this release! I am curious if you plan to implement style transfer as you did in ldm1? Or if you have any pointers / workflow that I should follow in order to try and achieve this? Thank you!

License

Please change the license @haoheliu. This will open up new possibilities for the world of open source software. This is currently the one of the only diffusion-based audio generators. This has the potential to make as large of an impact as Mistral, Llama, or Bark. Please consider open-sourcing this. Thank you!

ERROR: Cannot install transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git) and transformers==4.30.2 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git)
The user requested transformers==4.30.2

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

Win11 RXT3060 12GB

I have tried removing the package version but it still can't solve the requirement.

Can't load tokenizer for 'roberta-base'.

请问出现这种情况如何处理呢?目前自己没找到解决方案。尝试重新按照流程安装,依然出现这个问题。
OSError: Can't load tokenizer for 'roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.

RAM issue: MPS backend out of memory

Hi,
I am running AUDIOLDM2 on a M1 macbook air with 16GB of Ram on Ventura 13.5.

On my second test I ran out of memory quickly:
RuntimeError: MPS backend out of memory (MPS allocated: 18.02 GB, other allocations: 107.10 MB, max allowed: 18.13 GB). Tried to allocate 4.50 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Is there any way to prevent this, especially when using the inputtext list option?

Is there a workaround or trick to keep this from crashing on my machine?

Thank you!

Will there be models for AudioLDM2 that run on 12GB VRAM Cards?

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.29 GiB already allocated; 0 bytes free; 11.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

not working on OSX

Hi,
Would love to try this out, but I keep getting this error when following the exact install instructions on 0SX 10.15 (intel mac pro)

File "/Users/dk/opt/miniconda3/envs/audioldm/lib/python3.8/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

I tried adding this environment variable to my .zshrc file, since that was suggested somewhere, but no difference...
export PYTORCH_ENABLE_MPS_FALLBACK=1
I'd be very thankful for any help to get this running!

checkpoint for audiomae?

Thank you for this amazing work. In your code, I noticed a checkpoint path for the fine-tuned audiomae. Will you share the checkpoint?

LocalEntryNotFoundError

raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

RuntimeError: Pretrained weights not found for model HTSAT-base.

Hello, thank you for sharing the model of your amazing work.

I wish to try out speech generation based on (1) the transcription and (2) the description of the speaker.

However, when I run audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day", I get an error,

RuntimeError: Pretrained weights (/mnt/bn/lqhaoheliu/exps/checkpoints/audioldm/2023_04_07_audioldm_clap_v2_yusong/music_speech_audioset_epoch_15_esc_89.98.pt) not found for model HTSAT-base.

What could I be doing wrong?

Many thanks,

Cannot import name 'AudioLDM2Pipeline' from 'diffusers

having an issue while trying to run the checkpoint from hugging face:

ImportError Traceback (most recent call last)
in <cell line: 3>()
1 import scipy
2 import torch
----> 3 from diffusers import AudioLDM2Pipeline
4 from IPython.display import Audio
5

ImportError: cannot import name 'AudioLDM2Pipeline' from 'diffusers' (/usr/local/lib/python3.10/dist-packages/diffusers/init.py)

Windows not yet supported for torch.compile

raise RuntimeError("Windows not yet supported for torch.compile")
RuntimeError: Windows not yet supported for torch.compile

I get this error, is that means torch.compile can not support Windows?

thank you,

AttributeError: module 'gradio' has no attribute 'Box'

When following the instructions on the readme, cd into audioLDM2 and run python3 app.py it returns the following error:

Traceback (most recent call last):
File "app.py", line 226, in
with gr.Box():
AttributeError: module 'gradio' has no attribute 'Box'

I've looked at the documentation online for gradio and there does not seem to be a "Box" attribute for gradio, so I'm not quite sure what to do other than raising this issue.

Thank you

In-context learning with TTS model

Hi, thanks for the amazing models. I see the TTS models are added to the repo recently. Could you please give an example to provide the audio prompt to the TTS model (audioldm2-speech-gigaspeech) for in-context learning?

Training code

Could the authors please update the training code~~~

How to generate speech condition on not only transcripts and descriptions but also audio clip?

Hi,
Thanks for the open-source code.
I want to generate speech conditioned on transcripts, descriptions, and audio clips by using the audioldm-gigaspech pre-trained model.
However, I found the provided example only accepts transcripts and descriptions.
Can you also release the example using not only transcripts and descriptions but also audio clips?
or do you have some tips to modify the code to run the speech generation based on transcripts, descriptions, and audio clips?

Thanks in advance.

how to use implement text2speech?

Hi, thanks for your sharing , I can test it for Text-to-Music. and I want to test Text-to-Speech

How to write prompt to do it ??? Can you tell a template prompt for Text-to-Speech?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.