haoheliu / audioldm2 Goto Github PK

View Code? Open in Web Editor NEW

2.2K 44.0 172.0 1.58 MB

Text-to-Audio/Music Generation

License: Other

Python 99.99% Batchfile 0.01% Shell 0.01%

audio-generation

audioldm2's Introduction

AudioLDM 2

This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.

Change Log

2023-08-27: Add two new checkpoints!
- 🌟 48kHz AudioLDM model: Now we support high-fidelity audio generation!
- 16kHz improved AudioLDM model: Trained with more data and optimized model architecture.

TODO

Add the text-to-speech checkpoint
Open-source the AudioLDM training code.
Support the generation of longer audio (> 10s)
Optimizing the inference speed of the model.
Integration with the Diffusers library (see 🧨 Diffusers)
Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as AudioLDMv1)

Web APP

Prepare running environment

conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2

Start the web application (powered by Gradio)

python3 app.py

A link will be printed out. Click the link to open the browser and play.

Commandline Usage

Installation

Prepare running environment

# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git

If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by

sudo apt-get install espeak

Run the model in commandline

Generate sound effect or Music based on a text prompt

audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Generate sound effect or music based on a list of text

audioldm2 -tl batch.lst

Generate speech based on (1) the transcription and (2) the description of the speaker

audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"

Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.

Random Seed Matters

Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.

audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Pretrained Models

You can choose model checkpoint by setting up "model_name":

# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

We have five checkpoints you can choose:

audioldm2-full (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
audioldm_48k: This checkpoint can generate high fidelity sound effect and music.
audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
audioldm2-full-large-1150k: Larger version of audioldm2-full.
audioldm2-music-665k: Music generation.
audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:

cpu
cuda
mps ( Notice that the computation requires about 20GB of RAM. )

Other options

  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED]

  optional arguments:
    -h, --help            show this help message and exit
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
    -dur DURATION, --duration DURATION
                        The duration of the samples
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.

Hugging Face 🧨 Diffusers

AudioLDM 2 is available in the Hugging Face 🧨 Diffusers library from v0.21.0 onwards. The official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

The Diffusers version of the code runs upwards of 3x faster than the native AudioLDM 2 implementation, and supports generating audios of arbitrary length.

To install 🧨 Diffusers and 🤗 Transformers, run:

pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate

You can then load pre-trained weights into the AudioLDM2 pipeline, and generate text-conditional audio outputs by providing a text prompt:

from diffusers import AudioLDM2Pipeline
import torch
import scipy

repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

Tips for obtaining high-quality generations can be found under the AudioLDM 2 docs, including the use of prompt engineering and negative prompting.

Tips for optimising inference speed can be found in the blog post AudioLDM 2, but faster ⚡️.

Cite this work

If you found this tool useful, please consider citing

@article{audioldm2-2024taslp,
  author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining}, 
  year={2024},
  volume={32},
  pages={2871-2883},
  doi={10.1109/TASLP.2024.3399607}
}

@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
  pages={21450-21474}
}

audioldm2's People

Contributors

Stargazers

Watchers

Forkers

fdoperezi kulhunter ishine kamilake maoshuiyang dschnee cocktailpeanut ai-jie01 reefish smrl distyapps maxcodextc ksksks2222 f901107 xyztlp jaedukseo yqty rkp64 kustomzone dan4k-tosh swisscakerowl dikens88 lxyjyy adrianwangzhao soon14 weiguang3100 aicodehunt microboym zeroxclem alexyen1000 hhy5277 lhl1001 jiajiajia789 shaun95 eltociear research-clone bawa74090 gary109 dmn-tsk tonywhite11 hitech777 techthiyanes gavinchen1314 githangar keyzf daddyunikii mbakpur123 songwithlyrics007 aveshrahmani05 pterameta marinatrajk pipeline-crawler husainburhanpurwala macroustc andriesjacobus accio-private 910882575 pranjalya ukaserge commerceless dongfengxin positivewon chukwuemerie-ezieke clionachee hadrienboyer mikiui59 aifahad oreoau kuntal-c suryatmodulus sixvo-labs pleaseopenai shywel realsrisri ken2190 devbox10 jansystemic camenduru bingtian88 paperwave forosoft aliang-voice wdx-skynet shen-2023 woodslee einzbernvl sync-fork feature-creep shyamsantoki chiranthr30 jmaigc suprah925 keyman9848 chunhualiu zuoyouzuo147 coinhubx luyao-cv neuroidss ifitsmanu zstreet87

audioldm2's Issues

NotImplementedError: The operator 'aten::upsample_bicubic2d.out' is not currently implemented for the MPS device

Following the setup instructions and launching the webapp... got this when executing the command:

python3 app.py

after having executed:

conda create -n audioldm python=3.8; conda activate audioldm pip3 install git+https://github.com/haoheliu/AudioLDM2.git git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2

License

Please change the license @haoheliu. This will open up new possibilities for the world of open source software. This is currently the one of the only diffusion-based audio generators. This has the potential to make as large of an impact as Mistral, Llama, or Bark. Please consider open-sourcing this. Thank you!

Doc string Update?

ERROR: Cannot install transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git) and transformers==4.30.2 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git)
The user requested transformers==4.30.2

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

Win11 RXT3060 12GB

I have tried removing the package version but it still can't solve the requirement.

how to use implement text2speech？

Hi， thanks for your sharing , I can test it for Text-to-Music. and I want to test Text-to-Speech

How to write prompt to do it ??? Can you tell a template prompt for Text-to-Speech？

New

...

checkpoint for audiomae?

Thank you for this amazing work. In your code, I noticed a checkpoint path for the fine-tuned audiomae. Will you share the checkpoint?

Windows not yet supported for torch.compile

raise RuntimeError("Windows not yet supported for torch.compile")
RuntimeError: Windows not yet supported for torch.compile

I get this error, is that means torch.compile can not support Windows?

thank you,

Cannot import name 'AudioLDM2Pipeline' from 'diffusers

having an issue while trying to run the checkpoint from hugging face:

ImportError Traceback (most recent call last)
in <cell line: 3>()
1 import scipy
2 import torch
----> 3 from diffusers import AudioLDM2Pipeline
4 from IPython.display import Audio
5

ImportError: cannot import name 'AudioLDM2Pipeline' from 'diffusers' (/usr/local/lib/python3.10/dist-packages/diffusers/init.py)

Steps to create a virtual environment for AudioLDM2?

I don't want to have to download everything again each time I want to use AudioLDM2 locally.

Will the image-to-audio model be open?

The most surprising part of AudioLDM2 was the results of converting images to audio.
Will this be a future release?

Strange sampling before the gradio demo starts

This sampling keeps working even if examples are deleted from demo. What i am doing wrong?

Training code

Could the authors please update the training code~~~

How to generate speech condition on not only transcripts and descriptions but also audio clip?

Hi,
Thanks for the open-source code.
I want to generate speech conditioned on transcripts, descriptions, and audio clips by using the audioldm-gigaspech pre-trained model.
However, I found the provided example only accepts transcripts and descriptions.
Can you also release the example using not only transcripts and descriptions but also audio clips?
or do you have some tips to modify the code to run the speech generation based on transcripts, descriptions, and audio clips?

Thanks in advance.

Can't load tokenizer for 'roberta-base'.

请问出现这种情况如何处理呢？目前自己没找到解决方案。尝试重新按照流程安装，依然出现这个问题。
OSError: Can't load tokenizer for 'roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.

How to do Image to Audio?

Couldn't find this in --help of audioldm2 library.

Unable to run on Mac M1

AssertionError: Torch not compiled with CUDA enabled

Any plans to support Mac M1?

Whether it's prompt or not affects the results generated？

Due to the prompt, does it happen that I would have wanted to generate sound effects, but instead I generate music, because the quality of the prompt affects the generated result a lot, or does it correspond to a specific task where I can choose the corresponding model? It's in your readme, but it doesn't seem to be an option in the huggingface api!

Trainig code?

Great work! Are you planing on releasing the model training / fine-tuning code? Or is this not the case, as with AudioLDM1?

RuntimeError: Pretrained weights not found for model HTSAT-base.

Hello, thank you for sharing the model of your amazing work.

I wish to try out speech generation based on (1) the transcription and (2) the description of the speaker.

However, when I run audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day", I get an error,

RuntimeError: Pretrained weights (/mnt/bn/lqhaoheliu/exps/checkpoints/audioldm/2023_04_07_audioldm_clap_v2_yusong/music_speech_audioset_epoch_15_esc_89.98.pt) not found for model HTSAT-base.

What could I be doing wrong?

Many thanks,

not working on OSX

Hi,
Would love to try this out, but I keep getting this error when following the exact install instructions on 0SX 10.15 (intel mac pro)

File "/Users/dk/opt/miniconda3/envs/audioldm/lib/python3.8/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

I tried adding this environment variable to my .zshrc file, since that was suggested somewhere, but no difference...
export PYTORCH_ENABLE_MPS_FALLBACK=1
I'd be very thankful for any help to get this running!

Web app just keeps processing forever and not doing anything at all.

Everything works fine after typing "py app.py"

I even got the downloads of the test audios that it creates the first time you run it, and they sound fine.

The issue arises whenever I actually try to create anything of my own. I get the link for the web app. I type it in, it pulls up fine. I enter the description for what I want to create, change my settings, it starts processing, and then.........................

It never finishes, ever. It doesn't even put a load on my GPU. My GPU was maxing out when launching "py app.py" initially, but the web app doesn't affect it. It's almost as if there's no communication between the web app and my GPU at all.

I left it going for 10+ minutes hoping something would happen, but it never does.

Any help would be greatly appreciated.

Could you provide the ChatGPT prompt of sound?

Prompt engineering?

Does it support prompt engineering?

Style Transfer

Thank you for this release! I am curious if you plan to implement style transfer as you did in ldm1? Or if you have any pointers / workflow that I should follow in order to try and achieve this? Thank you!

Installation and running error on M2 Mac

I am trying to install the repo and get it running on my M2 Mac but get the following error.

OSError: dlopen(/opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at8internal15invoke_parallelExxxRKNSt3__18functionIFvxxEEE Referenced from: <F096D2C3-ADC0-3EF4-ACF6-E3075A1DF8EE> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so Expected in: <F444C1C4-7CAA-34AA-AA17-B5ED7975BD31> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib

I have tried both webapp and command line options but receiving the same error.

While trying to look for a solution, I came across this SO post which is sorta related and might help.
https://stackoverflow.com/questions/73370909/m1-mac-returns-oserror-library-not-loaded

Posting it here in case someone found a solution already.

On Windows 11, running audioldm2 gives error "no modules named audioldm2.main"

From Windows 11 command line, after creating conda env and pip install of AudioLDM2.git, I try to run:

audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody.

and get the error:

C:\Users\myusername\AppData\Local\anaconda3\envs\audioldm\python.exe: No module named audioldm2.__main__; 'audioldm2' is a package and cannot be directly executed

Is there a version that runs on CPU?

gradio运行报错

app.py Expecting value: line 1 column 1 (char 0)

Model dropdown does not work

Start app.py
Click "Click to modify detailed configruations"
Last option says "Dropdown" but only 1 model is shown and it does not drop down to select the other models.

In-context learning with TTS model

Hi, thanks for the amazing models. I see the TTS models are added to the repo recently. Could you please give an example to provide the audio prompt to the TTS model (audioldm2-speech-gigaspeech) for in-context learning?

How can I train this model by myself ?

I want to specialize the model for generating what I want. Please teach me how to train this model.

Additionally, please also explain how to generate audio from images.

Function to go from raw audio to LOA

Is there any function in the code to go from raw audio to LOA? From the paper, I understand this is done by computing the mel-spec, passing it through the pre-trained MAE, and then doing some pooling. I'm trying to reverse-engineer this from the code but it's not trivial. Any help would be greatly appreciated :)

Will there be models for AudioLDM2 that run on 12GB VRAM Cards?

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.29 GiB already allocated; 0 bytes free; 11.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

RAM issue: MPS backend out of memory

Hi,
I am running AUDIOLDM2 on a M1 macbook air with 16GB of Ram on Ventura 13.5.

On my second test I ran out of memory quickly:
RuntimeError: MPS backend out of memory (MPS allocated: 18.02 GB, other allocations: 107.10 MB, max allowed: 18.13 GB). Tried to allocate 4.50 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Is there any way to prevent this, especially when using the inputtext list option?

Is there a workaround or trick to keep this from crashing on my machine?

Thank you!

AudioLDM2 via Pinokio -- CUDA out of memory...

I was excited to install AudioLDM2 via Pinokio, but unfortunately, I'm encountering issues with "CUDA out of memory..." after starting.
PyTorch reserves the entire amount of memory available on the GPU, regardless of its size.

I'm unable to proceed further. Do you have any tips or suggestions to help me resolve this issue?

Thank you in advance!

Question about AudioMAE details

Hi
I'm wondering if AudioMAE is frozen during the training or fine-tuned jointly?

RuntimeError("Windows not yet supported for torch.compile") RuntimeError: Windows not yet supported for torch.compile

having issues even manually going through and installing dependencies.

Unable to run app.py

After installing dependencies and run app.py I get it to download the default model and then:
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
Unexpected key(s) in state_dict: "clap.model.text_branch.embeddings.position_ids", "cond_stage_models.0.cond_stage_models.0.model.text_branch.embeddings.position_ids".

Any idea why this happens and how to resolve?

Text to Speech for AudioLDM 2 pipeline on Huggingface?

https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2

Seems not to support TTS, is it in plan to add your two tts checkpoints?

thank you

Is there any docs on how to run the Text to Speech?

I saw there are example outputs here

But there is no info how to run Text to Speech in the documentation

eg.

In a man's voice say "hello world"
In a women's say "hello world"

Possible Google Colab?

Can you run this on a Colab?

And how do you change the length of the piece in huggingface :)

LocalEntryNotFoundError

raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

Reproducing results on AudioCaps

Thanks for the great work!

What are the suggested inference hyperparameters / checkpoint to reproduce the results on AudioCaps? I was trying to generate audio via audioldm2 --model_name MODEL -t CAPTION in AudioCaps test set but was unable to get the same FAD/KL (1.42/0/98) in Table 1. Tried audioldm2-full and audioldm2-full-large-1150k with the default inference hyperparameters but their FAD/KL are ~2.7/1.3.

AttributeError: module 'gradio' has no attribute 'Box'

When following the instructions on the readme, cd into audioLDM2 and run python3 app.py it returns the following error:

Traceback (most recent call last):
File "app.py", line 226, in
with gr.Box():
AttributeError: module 'gradio' has no attribute 'Box'

I've looked at the documentation online for gradio and there does not seem to be a "Box" attribute for gradio, so I'm not quite sure what to do other than raising this issue.

Thank you

Optimisation 3: Torch Compile in blog fails

Optimisation 3: Torch Compile in blog () which is mentioned in README.md produced error messages when I add the following three statements in the the Colab noteook:

https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

torch._dynamo.config.suppress_errors = True

audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

The error messages are:

[2023-10-12 08:32:00,552] torch._dynamo.convert_frame: [ERROR] WON'T CONVERT forward /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py line 664
due to:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 403, in dyn_shape
raise DynamicOutputShapeException(func)
torch._subclasses.fake_tensor.DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1206, in run_node
raise RuntimeError(
RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py", line 71, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(

Set torch._dynamo.config.verbose=True for more information

DynamicOutputShapeException Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py in run_node(output_graph, node, args, kwargs, nnmodule)
1198 assert nnmodule is not None
-> 1199 return nnmodule(*args, **kwargs)
1200 elif op == "get_attr":

55 frames

DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Unsupported Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py in unimplemented(msg)
69 def unimplemented(msg: str):
70 assert msg != os.environ.get("BREAK", False)
---> 71 raise Unsupported(msg)
72
73

Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

Set torch._dynamo.config.verbose=True for more information

haoheliu / audioldm2 Goto Github PK

audioldm2's Introduction

AudioLDM 2

Change Log

TODO

Web APP

Commandline Usage

Installation

Run the model in commandline

Random Seed Matters

Pretrained Models

Other options

Hugging Face 🧨 Diffusers

Cite this work

audioldm2's People

Contributors

Stargazers

Watchers

Forkers

audioldm2's Issues

Recommend Projects

Recommend Topics

Recommend Org