declare-lab / tango Goto Github PK

View Code? Open in Web Editor NEW

916.0 24.0 71.0 19.92 MB

A family of diffusion models for text-to-audio generation.

Home Page: https://tango2-web.github.io/

License: Other

Python 81.67% Shell 0.01% Makefile 0.03% Dockerfile 0.07% MDX 6.89% Jupyter Notebook 11.33%

audio-generation diffusion diffusion-models language-models large-language-models text-to-audio

tango's Introduction

Tango: LLM-guided Diffusion-based Text-to-Audio Generation and DPO-based Alignment

🎵 🔥 🎉 🎉 We are releasing Tango 2 built upon Tango for text-to-audio generation. Tango 2 was initialized with the Tango-full-ft checkpoint and underwent alignment training using DPO on audio-alpaca, a pairwise text-to-audio preference dataset. Download the model, Access the demo. Trainer is available in the tango2 directory🎶

Quickstart on Google Colab

Colab	Info
	Tango_2_Google_Colab_demo.ipynb

Tango Model Family

Model Name	Model Path
Tango	https://huggingface.co/declare-lab/tango
Tango-Full-FT-Audiocaps	https://huggingface.co/declare-lab/tango-full-ft-audiocaps
Tango-Full-FT-Audio-Music-Caps	https://huggingface.co/declare-lab/tango-full-ft-audio-music-caps
Mustango	https://huggingface.co/declare-lab/mustango
Tango-Full	https://huggingface.co/declare-lab/tango-full
Tango-2	https://huggingface.co/declare-lab/tango2
Tango-2-full	https://huggingface.co/declare-lab/tango2-full

Description

TANGO is a latent diffusion model (LDM) for text-to-audio (TTA) generation. TANGO can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We perform comparably to current state-of-the-art models for TTA across both objective and subjective metrics, despite training the LDM on a 63 times smaller dataset. We release our model, training, inference code, and pre-trained checkpoints for the research community.

🎵 🔥 We are releasing Tango 2 built upon Tango for text-to-audio generation. Tango 2 was initialized with the Tango-full-ft checkpoint and underwent alignment training using DPO on audio-alpaca, a pairwise text-to-audio preference dataset. 🎶

🎵 🔥 We are also making Audio-alpaca available. Audio-alpaca is a pairwise preference dataset containing about 15k (prompt,audio_w, audio_l) triplets where given a textual prompt, audio_w is the preferred generated audio and audio_l is the undesirable audio. Download Audio-alpaca. Tango 2 was trained on Audio-alpaca.

Quickstart Guide

Download the TANGO model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango2")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

CheerClap.webm

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Thunder.webm

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

More generated samples are shown here.

Prerequisites

Our code is built on pytorch version 1.13.1+cu117. We mention torch==1.13.1 in the requirements file but you might need to install a specific cuda version of torch depending on your GPU device type.

Install requirements.txt.

git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt

You might also need to install libsndfile1 for soundfile to work properly in linux:

(sudo) apt-get install libsndfile1

Datasets

Follow the instructions given in the AudioCaps repository for downloading the data. The audio locations and corresponding captions are provided in our data directory. The *.json files are used for training and evaluation. Once you have downloaded your version of the data you should be able to map it using the file ids to the file locations provided in our data/*.json files.

Note that we cannot distribute the data because of copyright issues.

How to train?

We use the accelerate package from Hugging Face for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked.

You can now train TANGO on the AudioCaps dataset using:

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

The argument --augment uses augmented data for training as reported in our paper. We recommend training for at-least 40 epochs, which is the default in train.py.

To start training from our released checkpoint use the --hf_model argument.

accelerate launch train.py \
--hf_model "declare-lab/tango" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

Check train.py and train.sh for the full list of arguments and how to use them.

The training script should automatically download the AudioLDM weights from here. However if the download is slow or if you face any other issues then you can: i) download the audioldm-s-full file from here, ii) rename it to audioldm-s-full.ckpt, and iii) keep it in /home/user/.cache/audioldm/ direcrtory.

To train TANGO 2 on the Audio-alpaca dataset from TANGO checkpoint using: The training script will download audio_alpaca wav files and save it in {PATH_TO_DOWNLOAD_WAV_FILE}/audio_alpaca. Default location will be ~/.cache/huggingface/datasets.

accelerate launch  tango2/tango2-train.py --hf_model "declare-lab/tango-full-ft-audiocaps" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder  \
--learning_rate=9.6e-7 \
--num_train_epochs=5  \
--num_warmup_steps=200 \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4  \
--gradient_accumulation_steps=4 \
--beta_dpo=2000  \
--sft_first_epochs=1 \
--dataset_dir={PATH_TO_DOWNLOAD_WAV_FILE}

How to make inferences?

From your trained checkpoints

Checkpoints from training will be saved in the saved/*/ directory.

To perform audio generation and objective evaluation in AudioCaps test set from your trained checkpoint:

CUDA_VISIBLE_DEVICES=0 python inference.py \
--original_args="saved/*/summary.jsonl" \
--model="saved/*/best/pytorch_model_2.bin" \

Check inference.py and inference.sh for the full list of arguments and how to use them.

To perform audio generation and objective evaluation in AudioCaps test set for TANGO 2 :

CUDA_VISIBLE_DEVICES=0 python tango2/inference.py \
--original_args="saved/*/summary.jsonl" \
--model="saved/*/best/pytorch_model_2.bin" \

Note that TANGO 2 inference.py script is different from TANGO .

From our released checkpoints in Hugging Face Hub

To perform audio generation and objective evaluation in AudioCaps test set from our huggingface checkpoints:

python inference_hf.py --checkpoint="declare-lab/tango"

Note

We use functionalities from audioldm_eval for objective evalution in inference.py. It requires the gold reference audio files and generated audio files to have the same name. You need to create the directory data/audiocaps_test_references/subset and keep the reference audio files there. The files should have names as following: output_0.wav, output_1.wav and so on. The indices should correspond to the corresponding line indices in data/test_audiocaps_subset.json.

We use the term subset as some data instances originally released in AudioCaps have since been removed from YouTube and are no longer available. We thus evaluated our models on all the instances which were available as of 8th April, 2023.

We use wandb to log training and infernce results.

Experimental Results

Tango

Model	Datasets	Text	#Params	FD ↓	KL ↓	FAD ↓	OVL ↑	REL ↑
Ground truth	−	−	−	−	−	−	91.61	86.78

DiffSound	AS+AC	✓	400M	47.68	2.52	7.75	−	−
AudioGen	AS+AC+8 others	✗	285M	−	2.09	3.13	−	−
AudioLDM-S	AC	✗	181M	29.48	1.97	2.43	−	−
AudioLDM-L	AC	✗	739M	27.12	1.86	2.08	−	−

AudioLDM-M-Full-FT^‡	AS+AC+2 others	✗	416M	26.12	1.26	2.57	79.85	76.84
AudioLDM-L-Full^‡	AS+AC+2 others	✗	739M	32.46	1.76	4.18	78.63	62.69
AudioLDM-L-Full-FT	AS+AC+2 others	✗	739M	23.31	1.59	1.96	−	−

TANGO	AC	✓	866M	24.52	1.37	1.59	85.94	80.36

Tango 2

Model	Parameters	FAD ↓	KL ↓	IS ↑	CLAP ↑	OVL ↑	REL ↑
AudioLDM-M-Full-FT	416M	2.57	1.26	8.34	0.43	-	-
AudioLDM-L-Full	739M	4.18	1.76	7.76	0.43	-	-
AudioLDM 2-Full	346M	2.18	1.62	6.92	0.43	-	-
AudioLDM 2-Full-Large	712M	2.11	1.54	8.29	0.44	3.56	3.19
Tango-full-FT	866M	2.51	1.15	7.87	0.54	3.81	3.77
Tango 2	866M	2.69	1.12	9.09	0.57	3.99	4.07

Citation

Please consider citing the following articles if you found our work useful:

@misc{majumder2024tango,
      title={Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization}, 
      author={Navonil Majumder and Chia-Yu Hung and Deepanway Ghosal and Wei-Ning Hsu and Rada Mihalcea and Soujanya Poria},
      year={2024},
      eprint={2404.09956},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
@article{ghosal2023tango,
  title={Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model},
  author={Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Poria, Soujanya},
  journal={arXiv preprint arXiv:2304.13731},
  year={2023}
}

Acknowledgement

We borrow the code in audioldm and audioldm_eval from the AudioLDM repositories. We thank the AudioLDM team for open-sourcing their code.

tango's People

Contributors

Stargazers

Watchers

Forkers

soujanyaporia ishine whitefu entn-at techthiyanes bluenucleus hongwen-sun chenchy keyman9848 mortyzhou-shef-bit macroustc linuxdave hbcbh1999 tree-ind wentropy hishamhaniffa hufeihu billybonz1 ap1075 david-rx michael2464 valteresj2 smksyj edison2023zhao shaun95 965694547 thecooltechguy guoyang94 maxmax2016 stjordanis fengyunzaidushi xiaoxingchen505 fair-protocol amizadeh gilsondsouza alfounso suzhiba dudrill softwareengineer-imerjr yungmichael chenxwh platform-kit leonardlichking positivewon lin9x donstang mike100101100011 galagaygay aguizaro kennytat undeadyequ daniellin94144 comedian1926 chill4stev lhduc94 4ursmile omnihs1 dorucioclea render-ai ivan545332 deyh2020 prowlee pkamath2 hirthickraam aelnamaki gioannides zhikangniu hungchiayu1 gary109 saravananbcs songguoguo

tango's Issues

Can this model used for music generation?

Quality is first!

No text input. How can i use this repo to train?

about audio_alphaca_15k.json file?

could you release audio_alphaca_15k.json file for training?

Question about inference_hf.py

I noticed that you updated the inference_hf.py file in your tango repository. May I kindly ask how it differs from the inference.py file?

How can I change the duration of the output audio?

Hi, This is really a lovely repository. But how can I change the duration of the generated audio?
Thanks!!

AttributeError: 'AudioDiffusion' object has no attribute 'device'

Hi,

I am trying to run the following command from the README.md

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

However, I am getting this AttributeError

Traceback (most recent call last):
  File "/Users/fielguhit/Workspace/tango/train.py", line 534, in <module>
    main()
  File "/Users/fielguhit/Workspace/tango/train.py", line 434, in main
    device = model.device
  File "/Users/fielguhit/.pyenv/versions/3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AudioDiffusion' object has no attribute 'device'

Can you please advise?

Preview and save multiple samples of the same prompt

Hi,

I see in the model's page that you can generate multiple samples for the same prompt using, for example:
prompts = [
"A car engine revving",
"A dog barks and rustles with some clicking",
"Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will create two samples per prompt, so audios will comprise 2 sounds per prompt.

How do you preview the sounds and then save the one you want? I've tried using indexes, but to no success.

Regards,
esuriddick

Download Models to a Local Folder Instead of Cache Dir

At this moment Tango is downloading the model data from huggingface via huggingface to a .cache dir

/.cache/huggingface/hub/models--declare-lab--tango/snapshots/{some hashes}

It would be nice if the model was stored in the tango dir, e.g. tango/models.
This way it also becomes easier to use tango with docker volumes.

What is the proper loss value? My train and val loss is around 6.5-6.6 and do not drop.

Hi, thanks for open source this great project! I fine-tuned the tango model on my own dataset for about 20 epoch, but the train & val loss does not drop at all, and since the loss is around 6.7, I think this mean my model is generating random results.
May I ask what is your loss value for train and val on AudioCaps?
All my data are 10 seconds audio with 48khz 2 channel audio:

Input #0, wav, from 'qslWda0kTxA_70000_80000.wav':
  Metadata:
    encoder         : Lavf59.27.100
  Duration: 00:00:10.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s

Here is the template command and my training command:

# Continue training the LDM from our checkpoint using the --hf_model argument
accelerate launch train.py \
--train_file="data/train_audiocaps.json" --validation_file="data/valid_audiocaps.json" --test_file="data/test_audiocaps_subset.json" \
--hf_model "declare-lab/tango" --unet_model_config="configs/diffusion_model_config.json" --freeze_text_encoder \
--gradient_accumulation_steps 8 --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --augment \
--learning_rate=3e-5 --num_train_epochs 40 --snr_gamma 5 \
--text_column captions --audio_column location --checkpointing_steps="best"

# Continue training on my dataset
HF_ENDPOINT=https://hf-mirror.com accelerate launch train.py \
--hf_model "declare-lab/tango" --unet_model_config="configs/diffusion_model_config.json" --freeze_text_encoder \
--train_file="data/dataset_train.json" --validation_file="data/dataset_val.json" --test_file="data/dataset_test.json" \
--gradient_accumulation_steps 8 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--learning_rate=3e-5 --num_train_epochs 40 --snr_gamma 5 --num_train_epochs 20

Any help would be much appreciated!

Sound cloning

I'm looking to build on your research. I understand this isn't the scope of your project. Just curious for an interesting. Just wanted the thoughts of the creators. I want to retrain and repuprose and expertiment with this for expressive TTS instead of generic . I'm somewhat new to working with these models.

OBJECTIVES ->

retrain on a more dynamic dataset
synthetic dataset -> speech/text[w/special utterences]{real speech/lofi speech from 'BARK'}, speech w/synthetic audio envirnoments generated by 'tango'/text[I have a rather large dataset]

EXPECTATATIOS ->

most expressive hybrid TTS[TTS with semantic conditioned background environments]

QUESTIONS ->

what are your thought on approaching voice cloning with this style of architecture? I figure I should approach like inpainting?
If possible, wouldn't it clone any artifact contained in the speech audio?

CLOSING THOUGHTS ->
I'm opening to sharing my results with you guys privately. Appreciate your contribution to the community.

Can you provide your accelerate config

Can I retrain the model on the audiocaps dataset with a 24GB 4090？

The demo won't work if sentencepiece is not installed

Hello,

I created a new environment with python3.8, I installed all the requirements.txt dependencies and then I ran the demo. After following the steps of the jupyter notebook of #10 (comment) it didn't work. I solved it installing pip install sentencepiece

It was because I was getting the error. It could not be installed without sentencepiece.
OSError: [Errno 22] Invalid argument: '/home/pc/.cache/hub/models--google--flan-t5-large/blobs/W/"031ce30f2198693aabefeac9588257b81658d7f4.lock'

I just solved it, if someone else has this problem here is the solution.

apple m1?

Error message ive got:tango/venv/lib/python3.10/site-packages/torch/cuda/init.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Noisy audio samples

Hi,

I am trying to reproduce the results and have run the inference code. However, the generated audio samples are completely noisy. Any suggestions as to what might be going wrong here? I am sharing my inference.py script and the command I have used to run the code

CUDA_VISIBLE_DEVICES=0 python inference.py --test_file="../audiocaps/test_audiocaps_subset.json" --text_encoder_name="google/flan-t5-large" --scheduler_name="configs/stable_diffusion_2.1.json" --unet_model_config="configs/diffusion_model_config.json" --model="../audioldm-s-full.ckpt" --batch_size=6

tango_files.zip

Producing audio in different Sample Rate

Hey!

I was wondering if it was possible to train the model in 48kHz audio, and then generate audio directly in 48kHz. Has anyone attempted this?

encoder_attention_mask error

hello there ,thanks alot for this awesome tool.
i have a problem running it , i get this error
note that i use cpu method ,cause i get oom error when i use gpu cause i have only 8gb vram

PS C:\Users\Genesis\anaconda3\tango> & C:/Users/Genesis/anaconda3/envs/tango/python.exe c:/Users/Genesis/anaconda3/tango/generate.py
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<?, ?it/s]
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:42: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = pad_center(fft_window, filter_length)
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:151: FutureWarning: Pass sr=16000, n_fft=1024, n_mels=64, fmin=0, fmax=8000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
mel_basis = librosa_mel_fn(
UNet initialized randomly.
Some weights of the model checkpoint at google/flan-t5-large were not used when initializing T5EncoderModel: ['decoder.block.16.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.layer_norm.weight', 'decoder.block.7.layer.0.layer_norm.weight', 'decoder.block.10.layer.1.layer_norm.weight', 'decoder.block.13.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.DenseReluDense.wo.weight', 'decoder.block.6.layer.1.EncDecAttention.v.weight', 'decoder.block.16.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.0.layer_norm.weight', 'decoder.block.8.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.v.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.layer_norm.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.layer_norm.weight', 'decoder.block.1.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.0.SelfAttention.v.weight', 'decoder.block.16.layer.0.layer_norm.weight', 'decoder.block.11.layer.1.EncDecAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.17.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.1.EncDecAttention.v.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wo.weight', 'decoder.block.20.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.14.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.0.SelfAttention.o.weight', 'decoder.block.20.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.layer_norm.weight', 'decoder.block.23.layer.1.EncDecAttention.q.weight', 'decoder.block.17.layer.1.EncDecAttention.q.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.o.weight', 'decoder.block.9.layer.1.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.20.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.4.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.EncDecAttention.o.weight', 'decoder.block.15.layer.1.EncDecAttention.v.weight', 'decoder.block.6.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.1.layer_norm.weight', 'decoder.block.4.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.1.EncDecAttention.o.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.layer_norm.weight', 'decoder.block.17.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.q.weight', 'decoder.block.17.layer.0.layer_norm.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.o.weight', 'decoder.block.11.layer.2.layer_norm.weight', 'decoder.block.1.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.q.weight', 'decoder.block.22.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.0.layer_norm.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.0.SelfAttention.k.weight', 'decoder.block.4.layer.2.layer_norm.weight',
'decoder.block.1.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.layer_norm.weight', 'decoder.block.2.layer.1.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.10.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.2.DenseReluDense.wo.weight', 'decoder.block.3.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.19.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.q.weight', 'decoder.block.18.layer.0.SelfAttention.v.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.5.layer.2.DenseReluDense.wo.weight', 'decoder.block.8.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.2.layer_norm.weight', 'decoder.block.22.layer.2.layer_norm.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.1.EncDecAttention.k.weight', 'decoder.block.19.layer.0.layer_norm.weight', 'decoder.block.17.layer.2.layer_norm.weight', 'decoder.block.18.layer.1.layer_norm.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.20.layer.1.layer_norm.weight', 'decoder.block.3.layer.1.layer_norm.weight', 'decoder.block.5.layer.0.SelfAttention.o.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.11.layer.1.layer_norm.weight', 'decoder.block.3.layer.0.SelfAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.1.EncDecAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.v.weight', 'decoder.block.18.layer.0.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.o.weight', 'decoder.block.21.layer.1.EncDecAttention.o.weight', 'decoder.block.6.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.2.layer_norm.weight', 'decoder.block.8.layer.1.EncDecAttention.q.weight', 'decoder.block.10.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.2.DenseReluDense.wo.weight', 'decoder.block.17.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.1.EncDecAttention.q.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.layer_norm.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.7.layer.1.EncDecAttention.o.weight', 'decoder.block.8.layer.2.DenseReluDense.wo.weight', 'decoder.embed_tokens.weight', 'decoder.block.12.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.0.SelfAttention.o.weight', 'decoder.block.22.layer.0.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.1.EncDecAttention.o.weight', 'decoder.block.11.layer.1.EncDecAttention.o.weight', 'decoder.block.3.layer.1.EncDecAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.k.weight', 'decoder.block.14.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.16.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.7.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.SelfAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.EncDecAttention.k.weight', 'decoder.block.20.layer.0.layer_norm.weight', 'decoder.block.10.layer.0.layer_norm.weight', 'decoder.block.6.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.3.layer.0.SelfAttention.q.weight', 'decoder.block.10.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.0.layer_norm.weight', 'decoder.block.13.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.v.weight', 'decoder.block.17.layer.2.DenseReluDense.wo.weight', 'decoder.block.18.layer.2.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.1.EncDecAttention.q.weight', 'decoder.block.7.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.layer_norm.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.EncDecAttention.v.weight', 'decoder.block.18.layer.1.EncDecAttention.k.weight', 'decoder.block.10.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.o.weight', 'decoder.block.12.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.2.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.19.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.0.layer_norm.weight', 'decoder.block.14.layer.1.EncDecAttention.k.weight', 'decoder.block.17.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.2.layer_norm.weight', 'decoder.block.10.layer.1.EncDecAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.q.weight', 'decoder.block.5.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.v.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.18.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.1.EncDecAttention.k.weight', 'decoder.block.12.layer.1.EncDecAttention.o.weight', 'decoder.block.16.layer.1.EncDecAttention.k.weight', 'decoder.block.14.layer.0.SelfAttention.v.weight', 'decoder.block.23.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.1.EncDecAttention.k.weight', 'decoder.block.15.layer.0.layer_norm.weight', 'decoder.block.22.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.SelfAttention.k.weight', 'decoder.block.12.layer.2.layer_norm.weight', 'decoder.block.7.layer.1.layer_norm.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.5.layer.2.layer_norm.weight', 'decoder.block.22.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.0.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.0.layer_norm.weight', 'decoder.block.12.layer.1.layer_norm.weight', 'decoder.block.19.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.o.weight', 'lm_head.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.11.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.2.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.19.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_1.weight', 'decoder.final_layer_norm.weight', 'decoder.block.10.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.15.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.q.weight', 'decoder.block.14.layer.1.EncDecAttention.q.weight', 'decoder.block.23.layer.1.EncDecAttention.v.weight', 'decoder.block.22.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.0.SelfAttention.v.weight', 'decoder.block.1.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.0.SelfAttention.q.weight', 'decoder.block.11.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.o.weight', 'decoder.block.5.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.v.weight', 'decoder.block.9.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.8.layer.1.EncDecAttention.o.weight', 'decoder.block.17.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.1.layer_norm.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.21.layer.1.EncDecAttention.q.weight', 'decoder.block.3.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.q.weight', 'decoder.block.9.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.2.layer_norm.weight', 'decoder.block.2.layer.0.SelfAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.v.weight', 'decoder.block.3.layer.0.layer_norm.weight', 'decoder.block.2.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.23.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.SelfAttention.q.weight',
'decoder.block.22.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.21.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.1.EncDecAttention.q.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.k.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.0.SelfAttention.k.weight', 'decoder.block.21.layer.1.EncDecAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.q.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.q.weight', 'decoder.block.13.layer.1.EncDecAttention.k.weight', 'decoder.block.22.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.q.weight']

This IS expected if you are initializing T5EncoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing T5EncoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Successfully loaded checkpoint from: declare-lab/tango
c:\Users\Genesis\anaconda3\tango\models.py:223: FutureWarning: Accessing config attribute in_channels directly via 'UNet2DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet2DConditionModel's config object instead, e.g. 'unet.config.in_channels'.
num_channels_latents = self.unet.in_channels
Traceback (most recent call last):
File "c:\Users\Genesis\anaconda3\tango\generate.py", line 8, in
audio = tango.generate(prompt, steps=50)
File "c:\Users\Genesis\anaconda3\tango\tango.py", line 46, in generate
latents = self.model.inference([prompt], self.scheduler, steps, guidance, samples, disable_progress=disable_progress)
File "C:\Users\Genesis\anaconda3\envs\tango\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "c:\Users\Genesis\anaconda3\tango\models.py", line 234, in inference
noise_pred = self.unet(
File "C:\Users\Genesis\anaconda3\envs\tango\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
TypeError: UNet2DConditionModel.forward() got an unexpected keyword argument 'encoder_attention_mask'

16 GB of GPU memory runs out

Hi.

I'm trying to train this model on a single P100 with 16 GB memory but seem to be running out of memory with a batch size of 2. Do I need more than 16 GB for this model? How can I reduce the GPU memory usage?

Cheers,

Is it possible to do fine tuning and incremental training?

I'm curious if it would be possible to fine tune the TANGO model through incremental training.

For example, I'd like to start with the audiocaps dataset, then fine tune the model with new audio datasets on a recurring basis.

Do you have any thoughts on how this could be achieved?

ImportError: cannot import name 'Tango' from 'tango' (unknown location)

Getting this error:

ImportError: cannot import name 'Tango' from 'tango' (unknown location)

HF Space seems to be down?

GEtting a 404 for this - https://huggingface.co/spaces/declare-lab/tango-text-to-audio-generation
Was it taken down?

Inference on server without GPU

Cuda is required for this to work. Can you enable cpu inference on server. I know it could be slow, but I don't mind the wait for experimentation purpose.

Nan loss in training

Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?

Hardware.

Hello.

Amazing work.

What kind of hardware should I expect to need to be able to run the model?

Thank you.

Training on own model

Hi, amazing work all!
From what I understand it is possible to train the model on my own data. I've got a sound library, which I want to use to train the model. Do I just replace the .json elements in the data folder?

/Update:
I replaced the data folder json's with mine and after some back and forth I'm stick at the following error. Of course the columns match:

ValueError: Couldn't cast
dataset: string
location: string
to
{'dataset': Value(dtype='string', id=None), 'location': Value(dtype='string', id=None), 'captions': Value(dtype='string', id=None)}
because column names don't match

Perhapse the issue is how the file is made? I pulled he metadata into xlsx -> made python script to make json and did some formatting fixes.

thx!

Ps. how big is your training setup in terms of GPU's?

Inference

Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.

The vae decoder cannot recover original audio with the extracted latent code

Hi! Thank u for making this amazing project public!
I just want a guidance for a problem I meet when tuning this code. The confusion is that why the vae decoder cannot recover the original speech wav when I directly use the latent code extracted by the provided encoder as the input?

Reproduce result

How can I reproduce my evaluation result ? I realized that the results of two evaluation for a sample are different. Can I set seed ?

HOW MUCH VRAM IS REQUIRED FOR INFERENCE?

Hello, could you please tell us how many VRAM GB are required for inference?
Thank you!

Classifier-free guidance(CFG) in training vs inference

Thanks for the awesome code.

If I am not wrong, the CFG implementation during training and inference seems to have some inconsitency.

During training, the unconditioned text-embedding is set to 0

mask_indices = [k for k in range(len(prompt)) if random.random() < 0.1]
if len(mask_indices) > 0:
    encoder_hidden_states[mask_indices] = 0

While during inference, an empty prompt is used for the text-unconditioning

uncond_tokens = [""] * len(prompt)

Are these two ways equivalent? Or am I missing something?

Downloading AudioCaps data

Hi,

I'm trying to download the AudioCaps data in order to train the Tango model. However, I'm not seeing any instructions in the AudioCaps repository on how to download it. Can you share any scripts or instructions on how to download and format the audio to train Tango?

Thanks!

Could you provide the ChatGPT prompt of sound?

The latter is obtained by prompting ChatGPT to explain the sound generated when a boat moves on the sea. Using this ChatGPT-generated description of the sound, TANGO provides superior results.

So, could you provide the ChatGPT prompt of sound?

Cannot install on Windows

git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt

Ends with

Collecting accelerate==0.18.0
  Using cached accelerate-0.18.0-py3-none-any.whl (215 kB)
Collecting bitsandbytes==0.38.1
  Using cached bitsandbytes-0.38.1-py3-none-any.whl (104.3 MB)
Collecting black==22.3.0
  Using cached black-22.3.0-cp310-cp310-win_amd64.whl (1.1 MB)
Collecting compel==1.1.3
  Using cached compel-1.1.3-py3-none-any.whl (24 kB)
Collecting d4rl==1.1
  Using cached d4rl-1.1-py3-none-any.whl (26.4 MB)
Collecting data_generator==1.0.1
  Using cached Data_Generator-1.0.1-py3-none-any.whl (11 kB)
Collecting deepdiff==6.3.0
  Using cached deepdiff-6.3.0-py3-none-any.whl (69 kB)
Collecting diffusion==6.9.1
  Using cached diffusion-6.9.1-1-py3-none-any.whl (179 kB)
Collecting einops==0.6.1
  Using cached einops-0.6.1-py3-none-any.whl (42 kB)
Collecting flash_attn==1.0.3.post0
  Using cached flash_attn-1.0.3.post0.tar.gz (2.0 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\Jason\AppData\Local\Temp\pip-install-ftl4rk2a\flash-attn_362c7a8bea504f679c59d4a480161f3e\setup.py", line 6, in <module>
          from packaging.version import parse, Version
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

don't have a correct `repo_id` and `repo_type`?

`import IPython
import soundfile as sf
from tango import Tango

print("init tango begin！")
tango = Tango("declare-lab/tango")
print("init tango！")

prompt = "An audience cheering and clapping"
print("prompt input！")

audio = tango.generate(prompt)
print("audio generate！")

sf.write(f"{prompt}.wav", audio, samplerate=16000)
#IPython.display.Audio(davoid imageThread::start_image(Mat frame)ta=audio, rate=16000)`
@deepanwayx @nmder @soujanyaporia
when i run the code ,following error occurs:

it seems that i don't have a correct repo_id and repo_type
how can i deal with it ?

License

Hi, thanks for making this great model. On your webpage, you claim that this model is "open source," however the CC-BY-NC-ND license is not open source. According to the OSI, open source software must not restrict commercial use:

No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

I would deeply appreciate it if you could make this model truly open source. Thank you!

About data augment

Hi!
Thanks so much for this work! When I tried to train the model on AudioCaps (didn't change the training script other than file paths), I got this issue:
File "/tango/train.py", line 553, in
main()
File "/tango/train.py", line 459, in main
mixed_mel, _, _, mixed_captions = torch_tools.augment_wav_to_fbank(audios, text, len(audios),
File "/tango/tools/torch_tools.py", line 118, in augment_wav_to_fbank
waveform, captions = augment(paths, texts)
File "/tango/tools/torch_tools.py", line 108, in augment
waveform = torch.tensor(np.concatenate(mixed_sounds, 0))
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate

It would be highly appreciated if you could kindly help me with this problem, thanks!

Is possible an inference with just a RTX 3080 of 10GB?

Hello,

I know it is very little memory, but it is what I have by now.

By default, the demo code won't inference because of cuda out of memory. I tried to reduce the batch size of the inference to just 1, but is not enough.

Do you know a way to reduce the memory consumption running the inference?

I know that the best solution is to upgrade the GPU to a RTX 3090/4090/A6000, but before that I would like to try another way if possible.

Thank you!

David Martin Rius

Update Requirement infos (ipython, libsndfile1)

Congratulation for your great project!

After going through your instructions, I found that I needed to install IPython manually. Please consider adding ipython to the requirement list.

Additionally, for soundfile to function properly (at least on Linux in my case), libsndfile1 is necessary. Maybe you also want to add this to the prerequisites documentation:
(sudo) apt-get install libsndfile1

about tango-full-ft-audiocaps

Hi, thanks for your great open source work!

In your work, I noticed that you used audiocaps dataset to fine tune on tango-full ckpt,
can I know the command for your fine tuning process?
Do I need to modify the learning rate (default=3e-5) and do I use --hf_model or --resume_from_checkpoint in the command?

Looking forward to your reply, thanks again!😊

Question: can we generate 1sec/ms length audio?

The homepage shows only 10secs for all the audios, i want to know if the audio length is controllable ? Or there is a minimum as in audioldm, you can't generate less then 2.5 secs

After just using VAE reconstruct a audio, I only get noise

Here is my code. Is there something wrong on my method about using vae?

`def recon_vae(self, filename):
        """ recon audio only by vae """
        with torch.no_grad():

        waveform, sample_rate = torchaudio.load(filename)
        waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
        waveform = waveform - torch.mean(waveform)
        waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
        waveform = 0.5 * waveform
        waveform = waveform / torch.max(torch.abs(waveform))
        waveform = 0.5 * waveform
      
        #waveform = 0.5 * waveform[0:int(len(waveform)*1)]
        
        audio = torch.unsqueeze(waveform, 0)
        audio = torch.nan_to_num(torch.clip(audio, -1, 1))
        audio = torch.autograd.Variable(audio, requires_grad=False)
        melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
        melspec = melspec.transpose(1, 2)
        melspec = melspec.unsqueeze(1)
        truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
       
        mel_recon = self.vae.decode_first_stage(truth_lattent)
        wave = self.vae.decode_to_waveform(mel_recon)
    return wave[0], waveform`