Giter Site home page Giter Site logo

audioldm2's Issues

not working on OSX

Hi,
Would love to try this out, but I keep getting this error when following the exact install instructions on 0SX 10.15 (intel mac pro)

File "/Users/dk/opt/miniconda3/envs/audioldm/lib/python3.8/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

I tried adding this environment variable to my .zshrc file, since that was suggested somewhere, but no difference...
export PYTORCH_ENABLE_MPS_FALLBACK=1
I'd be very thankful for any help to get this running!

ERROR: Cannot install transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git) and transformers==4.30.2 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers 4.32.0.dev0 (from git+https://github.com/huggingface/transformers.git)
The user requested transformers==4.30.2

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

Win11 RXT3060 12GB

I have tried removing the package version but it still can't solve the requirement.

Will there be models for AudioLDM2 that run on 12GB VRAM Cards?

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 12.00 GiB total capacity; 11.29 GiB already allocated; 0 bytes free; 11.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Can't load tokenizer for 'roberta-base'.

请问出现这种情况如何处理呢?目前自己没找到解决方案。尝试重新按照流程安装,依然出现这个问题。
OSError: Can't load tokenizer for 'roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.

Trainig code?

Great work! Are you planing on releasing the model training / fine-tuning code? Or is this not the case, as with AudioLDM1?

Cannot import name 'AudioLDM2Pipeline' from 'diffusers

having an issue while trying to run the checkpoint from hugging face:

ImportError Traceback (most recent call last)
in <cell line: 3>()
1 import scipy
2 import torch
----> 3 from diffusers import AudioLDM2Pipeline
4 from IPython.display import Audio
5

ImportError: cannot import name 'AudioLDM2Pipeline' from 'diffusers' (/usr/local/lib/python3.10/dist-packages/diffusers/init.py)

Whether it's prompt or not affects the results generated?

Due to the prompt, does it happen that I would have wanted to generate sound effects, but instead I generate music, because the quality of the prompt affects the generated result a lot, or does it correspond to a specific task where I can choose the corresponding model? It's in your readme, but it doesn't seem to be an option in the huggingface api!

AudioLDM2 via Pinokio -- CUDA out of memory...

I was excited to install AudioLDM2 via Pinokio, but unfortunately, I'm encountering issues with "CUDA out of memory..." after starting.
PyTorch reserves the entire amount of memory available on the GPU, regardless of its size.

I'm unable to proceed further. Do you have any tips or suggestions to help me resolve this issue?

Thank you in advance!

Optimisation 3: Torch Compile in blog fails

Optimisation 3: Torch Compile in blog () which is mentioned in README.md produced error messages when I add the following three statements in the the Colab noteook:

https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb

pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

torch._dynamo.config.suppress_errors = True

audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

The error messages are:

[2023-10-12 08:32:00,552] torch._dynamo.convert_frame: [ERROR] WON'T CONVERT forward /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py line 664
due to:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 403, in dyn_shape
raise DynamicOutputShapeException(func)
torch._subclasses.fake_tensor.DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1206, in run_node
raise RuntimeError(
RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py", line 71, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(

Set torch._dynamo.config.verbose=True for more information


DynamicOutputShapeException Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py in run_node(output_graph, node, args, kwargs, nnmodule)
1198 assert nnmodule is not None
-> 1199 return nnmodule(*args, **kwargs)
1200 elif op == "get_attr":

55 frames

DynamicOutputShapeException: aten.repeat_interleave.Tensor

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

RuntimeError: Failed running call_module self_down_blocks_1_attentions_2_transformer_blocks_0(*(FakeTensor(FakeTensor(..., device='meta', size=(2, 1024, 256), dtype=torch.float16), cuda:0),), **{'attention_mask': None, 'encoder_hidden_states': FakeTensor(FakeTensor(..., device='meta', size=(2, 18, 1024), dtype=torch.float16), cuda:0), 'encoder_attention_mask': FakeTensor(FakeTensor(..., device='meta', size=(2, 1, 18), dtype=torch.float16), cuda:0), 'timestep': None, 'cross_attention_kwargs': None, 'class_labels': None}):
aten.repeat_interleave.Tensor
(scroll up for backtrace)

During handling of the above exception, another exception occurred:

Unsupported Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/_dynamo/exc.py in unimplemented(msg)
69 def unimplemented(msg: str):
70 assert msg != os.environ.get("BREAK", False)
---> 71 raise Unsupported(msg)
72
73

Unsupported: dynamic shape operator: aten.repeat_interleave.Tensor

from user code:
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/audioldm2/modeling_audioldm2.py", line 1149, in forward
hidden_states = self.attentions[i * num_attention_per_layer + idx](
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformer_2d.py", line 323, in forward
hidden_states = block(

Set torch._dynamo.config.verbose=True for more information

How can I train this model by myself ?

I want to specialize the model for generating what I want. Please teach me how to train this model.

Additionally, please also explain how to generate audio from images.

RuntimeError: Pretrained weights not found for model HTSAT-base.

Hello, thank you for sharing the model of your amazing work.

I wish to try out speech generation based on (1) the transcription and (2) the description of the speaker.

However, when I run audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day", I get an error,

RuntimeError: Pretrained weights (/mnt/bn/lqhaoheliu/exps/checkpoints/audioldm/2023_04_07_audioldm_clap_v2_yusong/music_speech_audioset_epoch_15_esc_89.98.pt) not found for model HTSAT-base.

What could I be doing wrong?

Many thanks,

AttributeError: module 'gradio' has no attribute 'Box'

When following the instructions on the readme, cd into audioLDM2 and run python3 app.py it returns the following error:

Traceback (most recent call last):
File "app.py", line 226, in
with gr.Box():
AttributeError: module 'gradio' has no attribute 'Box'

I've looked at the documentation online for gradio and there does not seem to be a "Box" attribute for gradio, so I'm not quite sure what to do other than raising this issue.

Thank you

On Windows 11, running audioldm2 gives error "no modules named audioldm2.__main__"

From Windows 11 command line, after creating conda env and pip install of AudioLDM2.git, I try to run:

audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody.

and get the error:

C:\Users\myusername\AppData\Local\anaconda3\envs\audioldm\python.exe: No module named audioldm2.__main__; 'audioldm2' is a package and cannot be directly executed

Windows not yet supported for torch.compile

raise RuntimeError("Windows not yet supported for torch.compile")
RuntimeError: Windows not yet supported for torch.compile

I get this error, is that means torch.compile can not support Windows?

thank you,

In-context learning with TTS model

Hi, thanks for the amazing models. I see the TTS models are added to the repo recently. Could you please give an example to provide the audio prompt to the TTS model (audioldm2-speech-gigaspeech) for in-context learning?

Unable to run app.py

After installing dependencies and run app.py I get it to download the default model and then:
RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
Unexpected key(s) in state_dict: "clap.model.text_branch.embeddings.position_ids", "cond_stage_models.0.cond_stage_models.0.model.text_branch.embeddings.position_ids".

Any idea why this happens and how to resolve?

Runtime error: espeak not installed on your system

I am on a M1 macbook air on Monterey and followed the latest install instructions.
when I run the audiodlm2 example command "Musical constellations ...etc" I get this runtime error.

When I try "pip3 install espeak" no matching distribution is found, if I run "pip3 install phonemizer" it says it is already installed....
Any ideas?

Training code

Could the authors please update the training code~~~

Reproducing results on AudioCaps

Thanks for the great work!

What are the suggested inference hyperparameters / checkpoint to reproduce the results on AudioCaps? I was trying to generate audio via audioldm2 --model_name MODEL -t CAPTION in AudioCaps test set but was unable to get the same FAD/KL (1.42/0/98) in Table 1. Tried audioldm2-full and audioldm2-full-large-1150k with the default inference hyperparameters but their FAD/KL are ~2.7/1.3.

Possible Google Colab?

Can you run this on a Colab?

And how do you change the length of the piece in huggingface :)

Installation and running error on M2 Mac

I am trying to install the repo and get it running on my M2 Mac but get the following error.

OSError: dlopen(/opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at8internal15invoke_parallelExxxRKNSt3__18functionIFvxxEEE Referenced from: <F096D2C3-ADC0-3EF4-ACF6-E3075A1DF8EE> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torchaudio/lib/libtorchaudio.so Expected in: <F444C1C4-7CAA-34AA-AA17-B5ED7975BD31> /opt/homebrew/anaconda3/lib/python3.11/site-packages/torch/lib/libtorch_cpu.dylib

I have tried both webapp and command line options but receiving the same error.

While trying to look for a solution, I came across this SO post which is sorta related and might help.
https://stackoverflow.com/questions/73370909/m1-mac-returns-oserror-library-not-loaded

Posting it here in case someone found a solution already.

RAM issue: MPS backend out of memory

Hi,
I am running AUDIOLDM2 on a M1 macbook air with 16GB of Ram on Ventura 13.5.

On my second test I ran out of memory quickly:
RuntimeError: MPS backend out of memory (MPS allocated: 18.02 GB, other allocations: 107.10 MB, max allowed: 18.13 GB). Tried to allocate 4.50 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Is there any way to prevent this, especially when using the inputtext list option?

Is there a workaround or trick to keep this from crashing on my machine?

Thank you!

Cuda out of memory?

Could you please tell me how much GPU memory is required for model inference

checkpoint for audiomae?

Thank you for this amazing work. In your code, I noticed a checkpoint path for the fine-tuned audiomae. Will you share the checkpoint?

License

Please change the license @haoheliu. This will open up new possibilities for the world of open source software. This is currently the one of the only diffusion-based audio generators. This has the potential to make as large of an impact as Mistral, Llama, or Bark. Please consider open-sourcing this. Thank you!

Style Transfer

Thank you for this release! I am curious if you plan to implement style transfer as you did in ldm1? Or if you have any pointers / workflow that I should follow in order to try and achieve this? Thank you!

Web app just keeps processing forever and not doing anything at all.

Everything works fine after typing "py app.py"

I even got the downloads of the test audios that it creates the first time you run it, and they sound fine.

The issue arises whenever I actually try to create anything of my own. I get the link for the web app. I type it in, it pulls up fine. I enter the description for what I want to create, change my settings, it starts processing, and then.........................

It never finishes, ever. It doesn't even put a load on my GPU. My GPU was maxing out when launching "py app.py" initially, but the web app doesn't affect it. It's almost as if there's no communication between the web app and my GPU at all.

I left it going for 10+ minutes hoping something would happen, but it never does.

Any help would be greatly appreciated.

How to generate speech condition on not only transcripts and descriptions but also audio clip?

Hi,
Thanks for the open-source code.
I want to generate speech conditioned on transcripts, descriptions, and audio clips by using the audioldm-gigaspech pre-trained model.
However, I found the provided example only accepts transcripts and descriptions.
Can you also release the example using not only transcripts and descriptions but also audio clips?
or do you have some tips to modify the code to run the speech generation based on transcripts, descriptions, and audio clips?

Thanks in advance.

Function to go from raw audio to LOA

Is there any function in the code to go from raw audio to LOA? From the paper, I understand this is done by computing the mel-spec, passing it through the pre-trained MAE, and then doing some pooling. I'm trying to reverse-engineer this from the code but it's not trivial. Any help would be greatly appreciated :)

LocalEntryNotFoundError

raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

Model dropdown does not work

Start app.py
Click "Click to modify detailed configruations"
Last option says "Dropdown" but only 1 model is shown and it does not drop down to select the other models.

how to use implement text2speech?

Hi, thanks for your sharing , I can test it for Text-to-Music. and I want to test Text-to-Speech

How to write prompt to do it ??? Can you tell a template prompt for Text-to-Speech?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.