haoheliu / audioldm2 Goto Github PK
View Code? Open in Web Editor NEWText-to-Audio/Music Generation
License: Other
Text-to-Audio/Music Generation
License: Other
Hi,
Would love to try this out, but I keep getting this error when following the exact install instructions on 0SX 10.15 (intel mac pro)
File "/Users/dk/opt/miniconda3/envs/audioldm/lib/python3.8/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled
I tried adding this environment variable to my .zshrc file, since that was suggested somewhere, but no difference...
export PYTORCH_ENABLE_MPS_FALLBACK=1
I'd be very thankful for any help to get this running!
The conflict is caused by:
The user requested transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git)
The user requested transformers==4.30.2
To fix this you could try to:
Win11 RXT3060 12GB
I have tried removing the package version but it still can't solve the requirement.
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.29 GiB already allocated; 0 bytes free; 11.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
请问出现这种情况如何处理呢?目前自己没找到解决方案。尝试重新按照流程安装,依然出现这个问题。
OSError: Can't load tokenizer for 'roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.
Great work! Are you planing on releasing the model training / fine-tuning code? Or is this not the case, as with AudioLDM1?
having an issue while trying to run the checkpoint from hugging face:
ImportError Traceback (most recent call last)
in <cell line: 3>()
1 import scipy
2 import torch
----> 3 from diffusers import AudioLDM2Pipeline
4 from IPython.display import Audio
5
ImportError: cannot import name 'AudioLDM2Pipeline' from 'diffusers' (/usr/local/lib/python3.10/dist-packages/diffusers/init.py)
Due to the prompt, does it happen that I would have wanted to generate sound effects, but instead I generate music, because the quality of the prompt affects the generated result a lot, or does it correspond to a specific task where I can choose the corresponding model? It's in your readme, but it doesn't seem to be an option in the huggingface api!
I was excited to install AudioLDM2 via Pinokio, but unfortunately, I'm encountering issues with "CUDA out of memory..." after starting.
PyTorch reserves the entire amount of memory available on the GPU, regardless of its size.
I'm unable to proceed further. Do you have any tips or suggestions to help me resolve this issue?
Thank you in advance!
Optimisation 3: Torch Compile in blog () which is mentioned in README.md produced error messages when I add the following three statements in the the Colab noteook:
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
torch._dynamo.config.suppress_errors = True
audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]
The error messages are:
[2023-10-12 08:32:00,552] torch._dynamo.convert_frame: [ERROR] WON'T CONVERT forward /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py line 664
due to:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 403, in dyn_shape
raise DynamicOutputShapeException(func)
torch._subclasses.fake_tensor.DynamicOutputShapeException: aten.repeat_interleave.Tensor
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1206, in run_node
raise RuntimeError(
RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py", line 71, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor
from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(
Set torch._dynamo.config.verbose=True for more information
DynamicOutputShapeException Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py in run_node(output_graph, node, args, kwargs, nnmodule)
1198 assert nnmodule is not None
-> 1199 return nnmodule(*args, **kwargs)
1200 elif op == "get_attr":
55 frames
DynamicOutputShapeException: aten.repeat_interleave.Tensor
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)
During handling of the above exception, another exception occurred:
Unsupported Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py in unimplemented(msg)
69 def unimplemented(msg: str):
70 assert msg != os.environ.get("BREAK", False)
---> 71 raise Unsupported(msg)
72
73
Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor
from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(
Set torch._dynamo.config.verbose=True for more information
I want to specialize the model for generating what I want. Please teach me how to train this model.
Additionally, please also explain how to generate audio from images.
I downloaded models separated from huggingface, where should I put the models?
Hello, thank you for sharing the model of your amazing work.
I wish to try out speech generation based on (1) the transcription and (2) the description of the speaker.
However, when I run audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"
, I get an error,
RuntimeError: Pretrained weights (/mnt/bn/lqhaoheliu/exps/checkpoints/audioldm/2023_04_07_audioldm_clap_v2_yusong/music_speech_audioset_epoch_15_esc_89.98.pt) not found for model HTSAT-base.
What could I be doing wrong?
Many thanks,
The most surprising part of AudioLDM2 was the results of converting images to audio.
Will this be a future release?
When following the instructions on the readme, cd into audioLDM2 and run python3 app.py it returns the following error:
Traceback (most recent call last):
File "app.py", line 226, in
with gr.Box():
AttributeError: module 'gradio' has no attribute 'Box'
I've looked at the documentation online for gradio and there does not seem to be a "Box" attribute for gradio, so I'm not quite sure what to do other than raising this issue.
Thank you
From Windows 11 command line, after creating conda env and pip install of AudioLDM2.git, I try to run:
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody.
and get the error:
C:\Users\myusername\AppData\Local\anaconda3\envs\audioldm\python.exe: No module named audioldm2.__main__; 'audioldm2' is a package and cannot be directly executed
raise RuntimeError("Windows not yet supported for torch.compile")
RuntimeError: Windows not yet supported for torch.compile
I get this error, is that means torch.compile can not support Windows?
thank you,
Hi, thanks for the amazing models. I see the TTS models are added to the repo recently. Could you please give an example to provide the audio prompt to the TTS model (audioldm2-speech-gigaspeech) for in-context learning?
having issues even manually going through and installing dependencies.
After installing dependencies and run app.py
I get it to download the default model and then:
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
Unexpected key(s) in state_dict: "clap.model.text_branch.embeddings.position_ids", "cond_stage_models.0.cond_stage_models.0.model.text_branch.embeddings.position_ids".
Any idea why this happens and how to resolve?
I am on a M1 macbook air on Monterey and followed the latest install instructions.
when I run the audiodlm2 example command "Musical constellations ...etc" I get this runtime error.
When I try "pip3 install espeak" no matching distribution is found, if I run "pip3 install phonemizer" it says it is already installed....
Any ideas?
Could the authors please update the training code~~~
Following the setup instructions and launching the webapp... got this when executing the command:
python3 app.py
after having executed:
conda create -n audioldm python=3.8; conda activate audioldm pip3 install git+https://github.com/haoheliu/AudioLDM2.git git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
app.py Expecting value: line 1 column 1 (char 0)
I don't want to have to download everything again each time I want to use AudioLDM2 locally.
...
Thanks for the great work!
What are the suggested inference hyperparameters / checkpoint to reproduce the results on AudioCaps? I was trying to generate audio via audioldm2 --model_name MODEL -t CAPTION
in AudioCaps test set but was unable to get the same FAD/KL (1.42/0/98) in Table 1. Tried audioldm2-full
and audioldm2-full-large-1150k
with the default inference hyperparameters but their FAD/KL are ~2.7/1.3.
Can you run this on a Colab?
And how do you change the length of the piece in huggingface :)
AssertionError: Torch not compiled with CUDA enabled
Any plans to support Mac M1?
I am trying to install the repo and get it running on my M2 Mac but get the following error.
OSError: dlopen(/opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at8internal15invoke_parallelExxxRKNSt3__18functionIFvxxEEE Referenced from: <F096D2C3-ADC0-3EF4-ACF6-E3075A1DF8EE> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so Expected in: <F444C1C4-7CAA-34AA-AA17-B5ED7975BD31> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib
I have tried both webapp and command line options but receiving the same error.
While trying to look for a solution, I came across this SO post which is sorta related and might help.
https://stackoverflow.com/questions/73370909/m1-mac-returns-oserror-library-not-loaded
Posting it here in case someone found a solution already.
Hi,
I am running AUDIOLDM2 on a M1 macbook air with 16GB of Ram on Ventura 13.5.
On my second test I ran out of memory quickly:
RuntimeError: MPS backend out of memory (MPS allocated: 18.02 GB, other allocations: 107.10 MB, max allowed: 18.13 GB). Tried to allocate 4.50 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
Is there any way to prevent this, especially when using the inputtext list option?
Is there a workaround or trick to keep this from crashing on my machine?
Thank you!
Could you please tell me how much GPU memory is required for model inference
Thank you for this amazing work. In your code, I noticed a checkpoint path for the fine-tuned audiomae. Will you share the checkpoint?
Please change the license @haoheliu. This will open up new possibilities for the world of open source software. This is currently the one of the only diffusion-based audio generators. This has the potential to make as large of an impact as Mistral, Llama, or Bark. Please consider open-sourcing this. Thank you!
Thank you for this release! I am curious if you plan to implement style transfer as you did in ldm1? Or if you have any pointers / workflow that I should follow in order to try and achieve this? Thank you!
Everything works fine after typing "py app.py"
I even got the downloads of the test audios that it creates the first time you run it, and they sound fine.
The issue arises whenever I actually try to create anything of my own. I get the link for the web app. I type it in, it pulls up fine. I enter the description for what I want to create, change my settings, it starts processing, and then.........................
It never finishes, ever. It doesn't even put a load on my GPU. My GPU was maxing out when launching "py app.py" initially, but the web app doesn't affect it. It's almost as if there's no communication between the web app and my GPU at all.
I left it going for 10+ minutes hoping something would happen, but it never does.
Any help would be greatly appreciated.
Hi,
Thanks for the open-source code.
I want to generate speech conditioned on transcripts, descriptions, and audio clips by using the audioldm-gigaspech pre-trained model.
However, I found the provided example only accepts transcripts and descriptions.
Can you also release the example using not only transcripts and descriptions but also audio clips?
or do you have some tips to modify the code to run the speech generation based on transcripts, descriptions, and audio clips?
Thanks in advance.
Is there any function in the code to go from raw audio to LOA? From the paper, I understand this is done by computing the mel-spec, passing it through the pre-trained MAE, and then doing some pooling. I'm trying to reverse-engineer this from the code but it's not trivial. Any help would be greatly appreciated :)
Does it support prompt engineering?
Couldn't find this in --help
of audioldm2 library.
https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2
Seems not to support TTS, is it in plan to add your two tts checkpoints?
thank you
Hi is there like a page or resource which has a collection of samples and audio generated in a centralised way along with their prompt ?
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
I saw there are example outputs here
But there is no info how to run Text to Speech in the documentation
eg.
In a man's voice say "hello world"
In a women's say "hello world"
Hi
I'm wondering if AudioMAE is frozen during the training or fine-tuned jointly?
Start app.py
Click "Click to modify detailed configruations"
Last option says "Dropdown" but only 1 model is shown and it does not drop down to select the other models.
Hi, thanks for your sharing , I can test it for Text-to-Music. and I want to test Text-to-Speech
How to write prompt to do it ??? Can you tell a template prompt for Text-to-Speech?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.