declare-lab / tango Goto Github PK
View Code? Open in Web Editor NEWA family of diffusion models for text-to-audio generation.
Home Page: https://tango2-web.github.io/
License: Other
A family of diffusion models for text-to-audio generation.
Home Page: https://tango2-web.github.io/
License: Other
Hi,
I am trying to run the following command from the README.md
accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \
However, I am getting this AttributeError
Traceback (most recent call last):
File "/Users/fielguhit/Workspace/tango/train.py", line 534, in <module>
main()
File "/Users/fielguhit/Workspace/tango/train.py", line 434, in main
device = model.device
File "/Users/fielguhit/.pyenv/versions/3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'AudioDiffusion' object has no attribute 'device'
Can you please advise?
Hi, thanks for open source this great project! I fine-tuned the tango model on my own dataset for about 20 epoch, but the train & val loss does not drop at all, and since the loss is around 6.7, I think this mean my model is generating random results.
May I ask what is your loss value for train and val on AudioCaps?
All my data are 10 seconds audio with 48khz 2 channel audio:
Input #0, wav, from 'qslWda0kTxA_70000_80000.wav':
Metadata:
encoder : Lavf59.27.100
Duration: 00:00:10.00, bitrate: 1536 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
Here is the template command and my training command:
# Continue training the LDM from our checkpoint using the --hf_model argument
accelerate launch train.py \
--train_file="data/train_audiocaps.json" --validation_file="data/valid_audiocaps.json" --test_file="data/test_audiocaps_subset.json" \
--hf_model "declare-lab/tango" --unet_model_config="configs/diffusion_model_config.json" --freeze_text_encoder \
--gradient_accumulation_steps 8 --per_device_train_batch_size=1 --per_device_eval_batch_size=2 --augment \
--learning_rate=3e-5 --num_train_epochs 40 --snr_gamma 5 \
--text_column captions --audio_column location --checkpointing_steps="best"
# Continue training on my dataset
HF_ENDPOINT=https://hf-mirror.com accelerate launch train.py \
--hf_model "declare-lab/tango" --unet_model_config="configs/diffusion_model_config.json" --freeze_text_encoder \
--train_file="data/dataset_train.json" --validation_file="data/dataset_val.json" --test_file="data/dataset_test.json" \
--gradient_accumulation_steps 8 --per_device_train_batch_size=1 --per_device_eval_batch_size=1 \
--learning_rate=3e-5 --num_train_epochs 40 --snr_gamma 5 --num_train_epochs 20
Any help would be much appreciated!
Hello, could you please tell us how many VRAM GB are required for inference?
Thank you!
Here is my code. Is there something wrong on my method about using vae?
`def recon_vae(self, filename):
""" recon audio only by vae """
with torch.no_grad():
waveform, sample_rate = torchaudio.load(filename)
waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
waveform = waveform - torch.mean(waveform)
waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
waveform = 0.5 * waveform
waveform = waveform / torch.max(torch.abs(waveform))
waveform = 0.5 * waveform
#waveform = 0.5 * waveform[0:int(len(waveform)*1)]
audio = torch.unsqueeze(waveform, 0)
audio = torch.nan_to_num(torch.clip(audio, -1, 1))
audio = torch.autograd.Variable(audio, requires_grad=False)
melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
melspec = melspec.transpose(1, 2)
melspec = melspec.unsqueeze(1)
truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
mel_recon = self.vae.decode_first_stage(truth_lattent)
wave = self.vae.decode_to_waveform(mel_recon)
return wave[0], waveform`
Hi,
I'm trying to download the AudioCaps data in order to train the Tango model. However, I'm not seeing any instructions in the AudioCaps repository on how to download it. Can you share any scripts or instructions on how to download and format the audio to train Tango?
Thanks!
I'm curious if it would be possible to fine tune the TANGO model through incremental training.
For example, I'd like to start with the audiocaps
dataset, then fine tune the model with new audio datasets on a recurring basis.
Do you have any thoughts on how this could be achieved?
hello there ,thanks alot for this awesome tool.
i have a problem running it , i get this error
note that i use cpu method ,cause i get oom error when i use gpu cause i have only 8gb vram
PS C:\Users\Genesis\anaconda3\tango> & C:/Users/Genesis/anaconda3/envs/tango/python.exe c:/Users/Genesis/anaconda3/tango/generate.py
Fetching 8 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<?, ?it/s]
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:42: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
fft_window = pad_center(fft_window, filter_length)
c:\Users\Genesis\anaconda3\tango\audioldm\audio\stft.py:151: FutureWarning: Pass sr=16000, n_fft=1024, n_mels=64, fmin=0, fmax=8000 as keyword args. From version 0.10 passing these as positional arguments will result in an error
mel_basis = librosa_mel_fn(
UNet initialized randomly.
Some weights of the model checkpoint at google/flan-t5-large were not used when initializing T5EncoderModel: ['decoder.block.16.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.layer_norm.weight', 'decoder.block.7.layer.0.layer_norm.weight', 'decoder.block.10.layer.1.layer_norm.weight', 'decoder.block.13.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.DenseReluDense.wo.weight', 'decoder.block.6.layer.1.EncDecAttention.v.weight', 'decoder.block.16.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.0.layer_norm.weight', 'decoder.block.8.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.v.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.layer_norm.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.layer_norm.weight', 'decoder.block.1.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.3.layer.0.SelfAttention.v.weight', 'decoder.block.16.layer.0.layer_norm.weight', 'decoder.block.11.layer.1.EncDecAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.17.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.1.EncDecAttention.v.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wo.weight', 'decoder.block.20.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.v.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.14.layer.1.EncDecAttention.o.weight', 'decoder.block.7.layer.0.SelfAttention.o.weight', 'decoder.block.20.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.2.layer_norm.weight', 'decoder.block.23.layer.1.EncDecAttention.q.weight', 'decoder.block.17.layer.1.EncDecAttention.q.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.10.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.o.weight', 'decoder.block.9.layer.1.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.20.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.4.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.1.EncDecAttention.o.weight', 'decoder.block.15.layer.1.EncDecAttention.v.weight', 'decoder.block.6.layer.2.layer_norm.weight', 'decoder.block.12.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.1.layer_norm.weight', 'decoder.block.4.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.11.layer.2.DenseReluDense.wo.weight', 'decoder.block.13.layer.1.EncDecAttention.o.weight', 'decoder.block.20.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.layer_norm.weight', 'decoder.block.17.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.q.weight', 'decoder.block.17.layer.0.layer_norm.weight', 'decoder.block.11.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.k.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.0.SelfAttention.k.weight', 'decoder.block.22.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.SelfAttention.o.weight', 'decoder.block.11.layer.2.layer_norm.weight', 'decoder.block.1.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.q.weight', 'decoder.block.22.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.0.SelfAttention.k.weight', 'decoder.block.18.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.0.layer_norm.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.2.layer.0.SelfAttention.k.weight', 'decoder.block.4.layer.2.layer_norm.weight',
'decoder.block.1.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.13.layer.0.layer_norm.weight', 'decoder.block.2.layer.1.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.10.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.2.DenseReluDense.wo.weight', 'decoder.block.3.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.19.layer.0.SelfAttention.k.weight', 'decoder.block.8.layer.0.SelfAttention.q.weight', 'decoder.block.18.layer.0.SelfAttention.v.weight', 'decoder.block.6.layer.1.EncDecAttention.o.weight', 'decoder.block.5.layer.2.DenseReluDense.wo.weight', 'decoder.block.8.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.o.weight', 'decoder.block.1.layer.2.layer_norm.weight', 'decoder.block.22.layer.2.layer_norm.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.1.EncDecAttention.k.weight', 'decoder.block.19.layer.0.layer_norm.weight', 'decoder.block.17.layer.2.layer_norm.weight', 'decoder.block.18.layer.1.layer_norm.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.20.layer.1.layer_norm.weight', 'decoder.block.3.layer.1.layer_norm.weight', 'decoder.block.5.layer.0.SelfAttention.o.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.16.layer.1.layer_norm.weight', 'decoder.block.11.layer.1.layer_norm.weight', 'decoder.block.3.layer.0.SelfAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.k.weight', 'decoder.block.20.layer.1.EncDecAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.v.weight', 'decoder.block.18.layer.0.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.o.weight', 'decoder.block.21.layer.1.EncDecAttention.o.weight', 'decoder.block.6.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.2.layer_norm.weight', 'decoder.block.8.layer.1.EncDecAttention.q.weight', 'decoder.block.10.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.2.DenseReluDense.wo.weight', 'decoder.block.17.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.1.EncDecAttention.q.weight', 'decoder.block.13.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.2.layer_norm.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.7.layer.1.EncDecAttention.o.weight', 'decoder.block.8.layer.2.DenseReluDense.wo.weight', 'decoder.embed_tokens.weight', 'decoder.block.12.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.0.SelfAttention.o.weight', 'decoder.block.22.layer.0.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.2.DenseReluDense.wo.weight', 'decoder.block.5.layer.1.EncDecAttention.o.weight', 'decoder.block.11.layer.1.EncDecAttention.o.weight', 'decoder.block.3.layer.1.EncDecAttention.o.weight', 'decoder.block.14.layer.0.SelfAttention.k.weight', 'decoder.block.14.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.16.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.2.DenseReluDense.wo.weight', 'decoder.block.7.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.SelfAttention.o.weight', 'decoder.block.19.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.1.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.4.layer.1.EncDecAttention.k.weight', 'decoder.block.20.layer.0.layer_norm.weight', 'decoder.block.10.layer.0.layer_norm.weight', 'decoder.block.6.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.3.layer.0.SelfAttention.q.weight', 'decoder.block.10.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.0.layer_norm.weight', 'decoder.block.13.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.EncDecAttention.k.weight', 'decoder.block.8.layer.1.EncDecAttention.v.weight', 'decoder.block.17.layer.2.DenseReluDense.wo.weight', 'decoder.block.18.layer.2.layer_norm.weight', 'decoder.block.6.layer.2.DenseReluDense.wo.weight', 'decoder.block.10.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.1.EncDecAttention.q.weight', 'decoder.block.7.layer.2.layer_norm.weight', 'decoder.block.7.layer.0.SelfAttention.q.weight', 'decoder.block.4.layer.0.layer_norm.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.EncDecAttention.v.weight', 'decoder.block.18.layer.1.EncDecAttention.k.weight', 'decoder.block.10.layer.0.SelfAttention.v.weight', 'decoder.block.20.layer.0.SelfAttention.o.weight', 'decoder.block.15.layer.0.SelfAttention.o.weight', 'decoder.block.12.layer.0.SelfAttention.q.weight', 'decoder.block.1.layer.0.SelfAttention.o.weight', 'decoder.block.2.layer.0.SelfAttention.v.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.19.layer.1.EncDecAttention.o.weight', 'decoder.block.23.layer.1.EncDecAttention.k.weight', 'decoder.block.5.layer.0.SelfAttention.q.weight', 'decoder.block.12.layer.0.layer_norm.weight', 'decoder.block.14.layer.1.EncDecAttention.k.weight', 'decoder.block.17.layer.0.SelfAttention.q.weight', 'decoder.block.21.layer.2.layer_norm.weight', 'decoder.block.10.layer.1.EncDecAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.q.weight', 'decoder.block.5.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.k.weight', 'decoder.block.1.layer.0.SelfAttention.v.weight', 'decoder.block.12.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.18.layer.1.EncDecAttention.q.weight', 'decoder.block.12.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.7.layer.1.EncDecAttention.k.weight', 'decoder.block.12.layer.1.EncDecAttention.o.weight', 'decoder.block.16.layer.1.EncDecAttention.k.weight', 'decoder.block.14.layer.0.SelfAttention.v.weight', 'decoder.block.23.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.k.weight', 'decoder.block.16.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.1.EncDecAttention.k.weight', 'decoder.block.15.layer.0.layer_norm.weight', 'decoder.block.22.layer.1.EncDecAttention.v.weight', 'decoder.block.23.layer.0.SelfAttention.k.weight', 'decoder.block.12.layer.2.layer_norm.weight', 'decoder.block.7.layer.1.layer_norm.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.5.layer.2.layer_norm.weight', 'decoder.block.22.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.0.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.k.weight', 'decoder.block.13.layer.1.layer_norm.weight', 'decoder.block.23.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.15.layer.1.EncDecAttention.q.weight', 'decoder.block.21.layer.0.layer_norm.weight', 'decoder.block.12.layer.1.layer_norm.weight', 'decoder.block.19.layer.2.DenseReluDense.wo.weight', 'decoder.block.22.layer.0.SelfAttention.o.weight', 'lm_head.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.11.layer.1.EncDecAttention.q.weight', 'decoder.block.16.layer.2.layer_norm.weight', 'decoder.block.14.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.19.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.2.DenseReluDense.wi_1.weight', 'decoder.final_layer_norm.weight', 'decoder.block.10.layer.0.SelfAttention.o.weight', 'decoder.block.17.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.15.layer.2.layer_norm.weight', 'decoder.block.19.layer.1.layer_norm.weight', 'decoder.block.20.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.5.layer.1.EncDecAttention.q.weight', 'decoder.block.14.layer.1.EncDecAttention.q.weight', 'decoder.block.23.layer.1.EncDecAttention.v.weight', 'decoder.block.22.layer.2.DenseReluDense.wo.weight', 'decoder.block.12.layer.0.SelfAttention.v.weight', 'decoder.block.1.layer.1.EncDecAttention.o.weight', 'decoder.block.21.layer.0.SelfAttention.q.weight', 'decoder.block.11.layer.0.SelfAttention.o.weight', 'decoder.block.18.layer.0.SelfAttention.o.weight', 'decoder.block.5.layer.0.layer_norm.weight', 'decoder.block.21.layer.0.SelfAttention.v.weight', 'decoder.block.9.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.9.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.18.layer.0.SelfAttention.q.weight', 'decoder.block.8.layer.1.EncDecAttention.o.weight', 'decoder.block.17.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.2.DenseReluDense.wo.weight', 'decoder.block.15.layer.1.layer_norm.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.21.layer.1.EncDecAttention.q.weight', 'decoder.block.3.layer.2.layer_norm.weight', 'decoder.block.16.layer.1.EncDecAttention.q.weight', 'decoder.block.9.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.14.layer.1.layer_norm.weight', 'decoder.block.1.layer.1.EncDecAttention.v.weight', 'decoder.block.9.layer.2.layer_norm.weight', 'decoder.block.2.layer.0.SelfAttention.q.weight', 'decoder.block.19.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.1.EncDecAttention.v.weight', 'decoder.block.3.layer.0.layer_norm.weight', 'decoder.block.2.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.23.layer.0.SelfAttention.v.weight', 'decoder.block.14.layer.0.SelfAttention.q.weight',
'decoder.block.22.layer.1.EncDecAttention.q.weight', 'decoder.block.15.layer.0.SelfAttention.v.weight', 'decoder.block.3.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.8.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.21.layer.1.EncDecAttention.v.weight', 'decoder.block.10.layer.2.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.o.weight', 'decoder.block.2.layer.1.EncDecAttention.k.weight', 'decoder.block.2.layer.1.EncDecAttention.v.weight', 'decoder.block.2.layer.1.EncDecAttention.q.weight', 'decoder.block.5.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.1.layer_norm.weight', 'decoder.block.9.layer.1.EncDecAttention.k.weight', 'decoder.block.21.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.23.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.2.DenseReluDense.wi_1.weight', 'decoder.block.17.layer.0.SelfAttention.k.weight', 'decoder.block.21.layer.1.EncDecAttention.k.weight', 'decoder.block.9.layer.0.SelfAttention.v.weight', 'decoder.block.11.layer.1.EncDecAttention.v.weight', 'decoder.block.5.layer.1.layer_norm.weight', 'decoder.block.9.layer.0.SelfAttention.q.weight', 'decoder.block.23.layer.2.DenseReluDense.wi_0.weight', 'decoder.block.6.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.SelfAttention.q.weight', 'decoder.block.13.layer.1.EncDecAttention.k.weight', 'decoder.block.22.layer.0.SelfAttention.k.weight', 'decoder.block.6.layer.1.EncDecAttention.q.weight']
in_channels
directly via 'UNet2DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet2DConditionModel's config object instead, e.g. 'unet.config.in_channels'.How can I reproduce my evaluation result ? I realized that the results of two evaluation for a sample are different. Can I set seed ?
Hello,
I created a new environment with python3.8, I installed all the requirements.txt dependencies and then I ran the demo. After following the steps of the jupyter notebook of #10 (comment) it didn't work. I solved it installing pip install sentencepiece
It was because I was getting the error. It could not be installed without sentencepiece.
OSError: [Errno 22] Invalid argument: '/home/pc/.cache/hub/models--google--flan-t5-large/blobs/W/"031ce30f2198693aabefeac9588257b81658d7f4.lock'
I just solved it, if someone else has this problem here is the solution.
Hey all!
Congrats on the amazing project and for making it open-sourced.
I want to fine-tune tango2, the results are really impressive.
I have a proprietary dataset of audio and text captions, but nothing in the Audio-Alpaca format...
How could one fine-tune tango 2 on those conditions?
thanks a lot
Hi, thanks for making this great model. On your webpage, you claim that this model is "open source," however the CC-BY-NC-ND license is not open source. According to the OSI, open source software must not restrict commercial use:
- No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.
I would deeply appreciate it if you could make this model truly open source. Thank you!
Congratulation for your great project!
After going through your instructions, I found that I needed to install IPython manually. Please consider adding ipython to the requirement list.
Additionally, for soundfile to function properly (at least on Linux in my case), libsndfile1 is necessary. Maybe you also want to add this to the prerequisites documentation:
(sudo) apt-get install libsndfile1
Hi, amazing work all!
From what I understand it is possible to train the model on my own data. I've got a sound library, which I want to use to train the model. Do I just replace the .json elements in the data folder?
/Update:
I replaced the data folder json's with mine and after some back and forth I'm stick at the following error. Of course the columns match:
ValueError: Couldn't cast
dataset: string
location: string
to
{'dataset': Value(dtype='string', id=None), 'location': Value(dtype='string', id=None), 'captions': Value(dtype='string', id=None)}
because column names don't match
Perhapse the issue is how the file is made? I pulled he metadata into xlsx -> made python script to make json and did some formatting fixes.
thx!
Ps. how big is your training setup in terms of GPU's?
Getting this error:
ImportError: cannot import name 'Tango' from 'tango' (unknown location)
could you release audio_alphaca_15k.json file for training?
Hey!
I was wondering if it was possible to train the model in 48kHz audio, and then generate audio directly in 48kHz. Has anyone attempted this?
Hi! Thank u for making this amazing project public!
I just want a guidance for a problem I meet when tuning this code. The confusion is that why the vae decoder cannot recover the original speech wav when I directly use the latent code extracted by the provided encoder as the input?
I noticed that you updated the inference_hf.py file in your tango repository. May I kindly ask how it differs from the inference.py file?
Error message ive got:tango/venv/lib/python3.10/site-packages/torch/cuda/init.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.
Hello,
I know it is very little memory, but it is what I have by now.
By default, the demo code won't inference because of cuda out of memory. I tried to reduce the batch size of the inference to just 1, but is not enough.
Do you know a way to reduce the memory consumption running the inference?
I know that the best solution is to upgrade the GPU to a RTX 3090/4090/A6000, but before that I would like to try another way if possible.
Thank you!
David Martin Rius
Hello.
Amazing work.
What kind of hardware should I expect to need to be able to run the model?
Thank you.
GEtting a 404 for this - https://huggingface.co/spaces/declare-lab/tango-text-to-audio-generation
Was it taken down?
Hi, thanks for your great open source work!
In your work, I noticed that you used audiocaps dataset to fine tune on tango-full ckpt,
can I know the command for your fine tuning process?
Do I need to modify the learning rate (default=3e-5) and do I use --hf_model or --resume_from_checkpoint in the command?
Looking forward to your reply, thanks again!😊
Hi
Thanks for sharing your project.
When I trained your model based on your config, however, the val and train loss was NAN.
I tried many times but the results are still the same.
Can you tell me the reasons and how to solve it?
Hi,
I am trying to reproduce the results and have run the inference code. However, the generated audio samples are completely noisy. Any suggestions as to what might be going wrong here? I am sharing my inference.py script and the command I have used to run the code
CUDA_VISIBLE_DEVICES=0 python inference.py --test_file="../audiocaps/test_audiocaps_subset.json" --text_encoder_name="google/flan-t5-large" --scheduler_name="configs/stable_diffusion_2.1.json" --unet_model_config="configs/diffusion_model_config.json" --model="../audioldm-s-full.ckpt" --batch_size=6
The latter is obtained by prompting ChatGPT to explain the sound generated when a boat moves on the sea. Using this ChatGPT-generated description of the sound, TANGO provides superior results.
So, could you provide the ChatGPT prompt of sound?
At this moment Tango is downloading the model data from huggingface via huggingface to a .cache dir
/.cache/huggingface/hub/models--declare-lab--tango/snapshots/{some hashes}
It would be nice if the model was stored in the tango dir, e.g. tango/models
.
This way it also becomes easier to use tango with docker volumes.
Hi!
Thanks so much for this work! When I tried to train the model on AudioCaps (didn't change the training script other than file paths), I got this issue:
File "/tango/train.py", line 553, in
main()
File "/tango/train.py", line 459, in main
mixed_mel, _, _, mixed_captions = torch_tools.augment_wav_to_fbank(audios, text, len(audios),
File "/tango/tools/torch_tools.py", line 118, in augment_wav_to_fbank
waveform, captions = augment(paths, texts)
File "/tango/tools/torch_tools.py", line 108, in augment
waveform = torch.tensor(np.concatenate(mixed_sounds, 0))
File "<array_function internals>", line 180, in concatenate
ValueError: need at least one array to concatenate
It would be highly appreciated if you could kindly help me with this problem, thanks!
Thanks for the awesome code.
If I am not wrong, the CFG implementation during training and inference seems to have some inconsitency.
During training, the unconditioned text-embedding is set to 0
mask_indices = [k for k in range(len(prompt)) if random.random() < 0.1]
if len(mask_indices) > 0:
encoder_hidden_states[mask_indices] = 0
While during inference, an empty prompt is used for the text-unconditioning
uncond_tokens = [""] * len(prompt)
Are these two ways equivalent? Or am I missing something?
Hi.
I'm trying to train this model on a single P100 with 16 GB memory but seem to be running out of memory with a batch size of 2. Do I need more than 16 GB for this model? How can I reduce the GPU memory usage?
Cheers,
`import IPython
import soundfile as sf
from tango import Tango
print("init tango begin!")
tango = Tango("declare-lab/tango")
print("init tango!")
prompt = "An audience cheering and clapping"
print("prompt input!")
audio = tango.generate(prompt)
print("audio generate!")
sf.write(f"{prompt}.wav", audio, samplerate=16000)
#IPython.display.Audio(davoid imageThread::start_image(Mat frame)ta=audio, rate=16000)`
@deepanwayx @nmder @soujanyaporia
when i run the code ,following error occurs:
it seems that i don't have a correct repo_id
and repo_type
how can i deal with it ?
Hi,
I see in the model's page that you can generate multiple samples for the same prompt using, for example:
prompts = [
"A car engine revving",
"A dog barks and rustles with some clicking",
"Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
This will create two samples per prompt, so audios will comprise 2 sounds per prompt.
How do you preview the sounds and then save the one you want? I've tried using indexes, but to no success.
Regards,
esuriddick
git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt
Ends with
Collecting accelerate==0.18.0
Using cached accelerate-0.18.0-py3-none-any.whl (215 kB)
Collecting bitsandbytes==0.38.1
Using cached bitsandbytes-0.38.1-py3-none-any.whl (104.3 MB)
Collecting black==22.3.0
Using cached black-22.3.0-cp310-cp310-win_amd64.whl (1.1 MB)
Collecting compel==1.1.3
Using cached compel-1.1.3-py3-none-any.whl (24 kB)
Collecting d4rl==1.1
Using cached d4rl-1.1-py3-none-any.whl (26.4 MB)
Collecting data_generator==1.0.1
Using cached Data_Generator-1.0.1-py3-none-any.whl (11 kB)
Collecting deepdiff==6.3.0
Using cached deepdiff-6.3.0-py3-none-any.whl (69 kB)
Collecting diffusion==6.9.1
Using cached diffusion-6.9.1-1-py3-none-any.whl (179 kB)
Collecting einops==0.6.1
Using cached einops-0.6.1-py3-none-any.whl (42 kB)
Collecting flash_attn==1.0.3.post0
Using cached flash_attn-1.0.3.post0.tar.gz (2.0 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "C:\Users\Jason\AppData\Local\Temp\pip-install-ftl4rk2a\flash-attn_362c7a8bea504f679c59d4a480161f3e\setup.py", line 6, in <module>
from packaging.version import parse, Version
ModuleNotFoundError: No module named 'packaging'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
I'm looking to build on your research. I understand this isn't the scope of your project. Just curious for an interesting. Just wanted the thoughts of the creators. I want to retrain and repuprose and expertiment with this for expressive TTS instead of generic . I'm somewhat new to working with these models.
OBJECTIVES ->
EXPECTATATIOS ->
QUESTIONS ->
CLOSING THOUGHTS ->
I'm opening to sharing my results with you guys privately. Appreciate your contribution to the community.
Quality is first!
No text input. How can i use this repo to train?
Hi, This is really a lovely repository. But how can I change the duration of the generated audio?
Thanks!!
Cuda is required for this to work. Can you enable cpu inference on server. I know it could be slow, but I don't mind the wait for experimentation purpose.
The homepage shows only 10secs for all the audios, i want to know if the audio length is controllable ? Or there is a minimum as in audioldm, you can't generate less then 2.5 secs
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.