0417keito / jen-1-pytorch Goto Github PK

Unofficial implementation JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models(https://arxiv.org/abs/2308.04729)

Python 100.00%

jen-1-pytorch's Introduction

JEN-1-pytorch

Unofficial implementation JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models(https://arxiv.org/abs/2308.04729)

README

💻 Installation

git clone https://github.com/0417keito/JEN-1-pytorch.git
cd JEN-1-pytorch
pip install -r requirements.txt

🐍Usage

Sampling

import torch
from generation import Jen1

ckpt_path =  'your ckpt path'
jen1 = Jen1(ckpt_path)

prompt = 'a beautiful song'
samples = jen1.generate(prompt)

Training

torchrun train.py

Dataset format

Json format. the name of the Json file must be the same as the target music file.

{"prompt": "a beautiful song"}

How should the data_dir be created?

'''
dataset_dir
├── audios
|    ├── music1.wav
|    ├── music2.wav
|    .......
|    ├── music{n}.wav
|
├── metadata
|   ├── music1.json
|   ├── music2.json
|   ......
|   ├── music{n}.json
|
'''

About config

please see config.py and conditioner_config.py

🧠TODO

Extension to JEN-1-Composer
Extension to music generation with singing voice
Adaptation of Consistency Model
In the paper, Diffusion Autoencoder was used, but I did not have much computing resources, so I used Encodec instead. So, if I can afford it, I will implement Diffusion Autoencoder.

🚀Demo

coming soon !

🙏Appreciation

Dr Adam Fils - Thank you for providing the GPU. I really appreciate Adam giving me this opportunity.

⭐️Show Your Support

If you find this repo interesting and useful, give us a ⭐️ on GitHub! It encourages us to keep improving the model and adding exciting features. Please inform us of any deficiencies by issue.

🙆Welcome Contributions

Contributions are always welcome.

jen-1-pytorch's People

Contributors

Stargazers

Watchers

Forkers

adamfils chenchy zoahmed-xyz ronalmoo mvsakrishna syabahmad huyquoctrinh shansongliu

jen-1-pytorch's Issues

Problem of the input_concat_cond

In the implementation of music inpainting and continuation tasks, I've noticed that the code concatenates the masked audio with the input.
However, in the processing of the masked audio, only the first value of the batch is taken and then duplicated. I'm curious about the reason for this. The comments indicate that even the author is unsure about the rationale behind this approach. I 'm curious to know the reference or source that inspired this piece of code. Thank you!

JEN-1-pytorch/generation.py

Line 171 in 97a8e7d

if len(self.input_concat_ids) > 0:

Prepare dataset - wav/mp3 should be a full-length audio or chunked one

I want to prepare data.

'''
How should the data_dir be created?

dataset_dir
├── audios
| ├── music1.wav
| ├── music2.wav
| .......
| ├── music{n}.wav
|
├── metadata
| ├── music1.json
| ├── music2.json
| ......
| ├── music{n}.json
|
'''

What should the length of music1.wav, music2.wav, etc., be? Should it be a full song, perhaps five minutes long, which will then be automatically trimmed for us? Or do I need to segment it (e.g., into 10-second segments) and place it in the audios folder?

Thank you!

The loss converge but when I try generation.py, it only generates noise audio wav

I followed the tutorial and trained my own model using approximately 300 hours of song accompaniment data. It converged well, but when I tried to generate a song from the best model, even using the same prompt input from the training set, it only generated noisy audio.

I checked the code and noticed that only the UNet1D is saved and loaded during inference, and the Diffusion model is not. Is there anyone who has successfully trained and can actually infer from their model who could offer me any tips? Thank you!

Great Progress!

Hey! I was going through the codebase and saw you've made amazing progress on replicating JEN-1. If you need support on the GPU side or datasets, please let me know and I'd be happy to provide access to some spare A100s, along with 200k copyright-free music files + descriptions.

I need help with the ckpt_path

I am trying to run JEN-1 by my own in a Colab Notebook, but I have an issue with de ckpt_path, because I can't find it. I hope someone can help. https://colab.research.google.com/drive/15YFORXT4YZyHv2oNdItaDpLh8M1fTbK7?usp=sharing. This is my Colab

RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

When running 'torchrun train.py' I get this error:
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Traceback:
Traceback (most recent call last):
File "E:\New folder\JEN-1-pytorch\train.py", line 129, in
main(config=Config)
File "E:\New folder\JEN-1-pytorch\train.py", line 20, in main
run(rank=0, n_gpus=1, config=config)
File "E:\New folder\JEN-1-pytorch\train.py", line 124, in run
trainer.train_loop()
File "E:\New folder\JEN-1-pytorch\trainer.py", line 116, in train_loop
for batch_idx, (audio_emb, metadata) in enumerate(data_iter):
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\utils\data\dataloader.py", line 631, in next
data = self._next_data()
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\utils\data\dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\utils\data\dataset.py", line 399, in getitems
return [self.dataset[self.indices[idx]] for idx in indices]
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\utils\data\dataset.py", line 399, in
return [self.dataset[self.indices[idx]] for idx in indices]
File "E:\New folder\JEN-1-pytorch\dataset\dataloader.py", line 95, in getitem
chunk = convert_audio(chunk, sr, model.sample_rate, model.channels)
File "C:\Users\jeroe\AppData\Roaming\Python\Python310\site-packages\encodec\utils.py", line 88, in convert_audio
wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torchaudio\transforms_transforms.py", line 979, in forward
return _apply_sinc_resample_kernel(waveform, self.orig_freq, self.new_freq, self.gcd, self.kernel, self.width)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torchaudio\functional\functional.py", line 1462, in _apply_sinc_resample_kernel
waveform = waveform.view(-1, shape[-1])
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous
[2024-02-25 14:34:58,648] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 27492) of binary: D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\python.exe
Traceback (most recent call last):
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\Scripts\torchrun.exe_main.py", line 7, in
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 347, in wrapper
return f(*args, kwargs)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\distributed\run.py", line 812, in main
run(args)
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\distributed\run.py", line 803, in run
elastic_launch(
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\distributed\launcher\api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\Users\jeroe\pinokio\bin\miniconda\envs\videodiff\lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-02-25_14:34:58
host : J-Café
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 27492)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Any update about checkpoints?

About SpeechTokenizer_trainer

Hi, bro
I'm sorry to open this issue here, because leaving issue is not available in the rep of speechtokenizer_trainer.
Thanks a lot for sharing such a useful and meaningful work, I've carefully read and try your training_code for weeks, but there is till some issues bothering me.
In my experiment, the code works well when '--do_distillation' turned off. But when distillation works, the gradient of speechtokenizer becomes NAN after the first backward step. I wander if you have successfully reproducte speechtokenizer or there is still some problem in loss_distillation. Thanks again for your sharing.