Giter Site home page Giter Site logo

bigvgan's Introduction

Sang-Hoon Lee

I will join the Department of Software and Computer Engineering at Ajou University as an Assistant Professor in Mar. 2024 (SAIL, Speech AI Lab.). I'm currently a postdoctoral researcher in AI Research Center, Korea University, Seoul, South Korea. I received the Ph.D. degree in the Department of Brain and Cognitive Engineering, Korea University in 2023. In March 2016, I started my integrated M.S.&Ph.D. in Pattern Recognition & Machine Learning (PRML) Lab at the Korea University in Seoul, Korea, under the supervision of Seong-Whan Lee.

πŸ‘€ Research Interests

  • Speech Synthesis (2019~, HierSpeech++, DDDM-VC)
  • Neural Vocoder (2021~, PeriodWave, PeriodWave-Turbo, Fre-GAN, Fre-GAN2)
  • Audio Generation (2023~, DDDM-Mixer)
  • Singing Voice Synthesis (2022~, MIDI-Voice, HiddenSinger)
  • Speech-to-Speech Translation (2023~, TranSentence)
  • Brain-Computer Interface (2019~2020, Brain-to-Speech System)
  • Reinforcement Learning (2017~2018, AI Curling Robot Curly)

βœ” News

  • 2024.04: One paper has been accepted to TASLP (DiffProsody)
  • 2024.01: One paper has been accepted to ICASSP 2024 (LIMMITS'24, ICASSP SP Grand Challenges)
  • 2023.12: One paper has been accepted to TASLP (Fre-Painter)
  • 2023.12: Two papers have been accepted to ICASSP 2024 (TranSentence, MIDI-Voice)
  • 2023.12: One paper has been accepted to AAAI 2024 (DDDM-VC)
  • 2023.11: We release HierSpeech++, Zero-shot Speech Synthesis models for Zero-shot TTS, Zero-shot VC, and Speech Super-resolution. [Demo] [Code] [Gradio]
  • 2023.06: We release HiddenSinger for High-quality singing voice synthesis system. This project was funded by Netmarble AI Center, Netmarble Corp. in 2022.

πŸŽ‰ Publications

Arxiv

2024

2023

2022

2021

~2020

Patents (KR)

  • "METHOD AND SYSTEM FOR SYNTHESIZING SPEECH," 10-2663162, 29, Apr., 2024.
  • "METHOD TO TRANSFORM VOICE," 10-2439022, 29, Aug., 2022.
  • "METHOD AND APPARTUS FOR VOICE CONVERSION BY USING NEURAL NETWORK," 10-2340486, 14, Dec., 2021.
  • "SYSTEM AND METHOD FOR CURLING SWEEPING CONTROL," 10-2257358, 21, May, 2021.
  • "APPARATUS AND METHOD FOR RECOMMENDATION OF CURLING GAME STRATEGY USING DEEP LEARNING," 10-2045567, 11, Nov., 2019.
  • "APPARATUS AND METHOD FOR DELIVERY AND SWEEPING AT CURLING GAME," 10-1948713, 11, Feb., 2019.

✨ Educations

2016.03-2023.02: Integrated M.S.&Ph.D, Dept. of Brain and Cognitive Engineering, Korea University

2012.03-2016.02: B.S, Dept. of Life Science, Dongguk University

🎁 Awards and Services

Reviewer: NeurIPS, ICLR, ICML, AAAI, ICASSP, IEEE/ACM Transactions on Audio, Speech, and, Language Processing

2022.02.25: Paper Award (Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis), Korea University

πŸŽ™Invited Talks

2024.06.25: Fake Audio Detection, Ajou University.

2024.06.07: Speech Synthesis, 제2회AIμœ΅ν•©μ›Œν¬μˆ, Ajou University.

2024.05.24: Speech Language Model for Generative AI, KSCS2024

2023.08.18: Towards Unified Speech Synthesis for Text-to-Speech and Voice Conversion, Deepbrain AI

2023.08.11: Towards Unified Speech Synthesis for Text-to-Speech and Voice Conversion, Workshop on Brain and Artificial Intelligence 2023

2023.06.20: HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis, Top Conference Session in KCC2023

2022.08.19: VoiceMixer: Adversarial Voice Style Mixup, AIGS Symposium 2022

2022.07.01: VoiceMixer: Adversarial Voice Style Mixup, Top Conference Session in KCC2022

2021.12.02: Voice Conversion, Netmarble

2021.07.29: Speech Synthesis and Voice Conversion, Neosapience

bigvgan's People

Contributors

sb-kim-prml avatar sh-lee-prml avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigvgan's Issues

Single node multi gpu training extremely slow vs single node 1 gpu training

Please, observe timestamps for 1 gpu training

2022-06-29 11:54:05,369	bigvgan_22k	INFO	Train Epoch: 55 [65%]
2022-06-29 11:54:05,369	bigvgan_22k	INFO	[3.5512986183166504, 2.5630311965942383, 5.836206436157227, 12.408121109008789, 120400, 0.00019865446220135974]
2022-06-29 11:56:36,490	bigvgan_22k	INFO	Train Epoch: 55 [74%]
2022-06-29 11:56:36,491	bigvgan_22k	INFO	[3.441840887069702, 2.838928461074829, 7.337043285369873, 13.521254539489746, 120600, 0.00019865446220135974]
2022-06-29 11:59:07,465	bigvgan_22k	INFO	Train Epoch: 55 [83%]
2022-06-29 11:59:07,465	bigvgan_22k	INFO	[3.481898784637451, 2.849485397338867, 5.888651371002197, 11.99553108215332, 120800, 0.00019865446220135974]
2022-06-29 12:01:38,538	bigvgan_22k	INFO	Train Epoch: 55 [93%]

2 GPU training (CUDA_VISIBLE_DEVICES="0,1")

2022-07-01 20:25:10,022	bigvgan_22k	INFO	{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45}, 'data': {'training_files': './dataset/VCTK-Corpus/preprocessed_npz', 'validation_files': './dataset/VCTK-Corpus/preprocessed_npz', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 125, 'cleaned_text': True, 'aug_rate': 1.0, 'top_db': 20}, 'model': {'p_dropout': 0.1, 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False}, 'model_dir': './logs/bigvgan_22k'}
2022-07-01 20:25:10,034	bigvgan_22k	WARNING	git hash values are different. a8aabb3f(saved) != 059e599a(current)
2022-07-01 20:25:48,974	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/G_120000.pth' (iteration 55)
2022-07-01 20:25:49,335	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/D_120000.pth' (iteration 55)
2022-07-01 20:26:22,415	bigvgan_22k	INFO	{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 32, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45}, 'data': {'training_files': './dataset/VCTK-Corpus/preprocessed_npz', 'validation_files': './dataset/VCTK-Corpus/preprocessed_npz', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 125, 'cleaned_text': True, 'aug_rate': 1.0, 'top_db': 20}, 'model': {'p_dropout': 0.1, 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False}, 'model_dir': './logs/bigvgan_22k'}
2022-07-01 20:26:22,428	bigvgan_22k	WARNING	git hash values are different. a8aabb3f(saved) != 059e599a(current)
2022-07-01 20:27:01,020	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/G_120000.pth' (iteration 55)
2022-07-01 20:27:01,323	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/D_120000.pth' (iteration 55)
2022-07-01 20:28:17,110	bigvgan_22k	INFO	{'train': {'log_interval': 200, 'eval_interval': 5000, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 16, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 8192, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45}, 'data': {'training_files': './dataset/VCTK-Corpus/preprocessed_npz', 'validation_files': './dataset/VCTK-Corpus/preprocessed_npz', 'text_cleaners': ['english_cleaners2'], 'max_wav_value': 32768.0, 'sampling_rate': 22050, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 125, 'cleaned_text': True, 'aug_rate': 1.0, 'top_db': 20}, 'model': {'p_dropout': 0.1, 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False}, 'model_dir': './logs/bigvgan_22k'}
2022-07-01 20:28:17,123	bigvgan_22k	WARNING	git hash values are different. a8aabb3f(saved) != 059e599a(current)
2022-07-01 20:28:55,551	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/G_120000.pth' (iteration 55)
2022-07-01 20:28:55,831	bigvgan_22k	INFO	Loaded checkpoint './logs/bigvgan_22k/D_120000.pth' (iteration 55)
2022-07-01 20:31:31,316	bigvgan_22k	INFO	Train Epoch: 55 [13%]
2022-07-01 20:31:31,317	bigvgan_22k	INFO	[3.5243208408355713, 2.7117111682891846, 6.130326271057129, 12.96578598022461, 69400, 0.00019862963039358455]
2022-07-01 20:34:15,913	bigvgan_22k	INFO	Train Epoch: 55 [29%]

Also, initial multi-GPU run results in log

urther require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/SpectralOps.cpp:659.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/torch/functional.py:572: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/SpectralOps.cpp:659.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/torch/functional.py:572: UserWarning: stft will soon require the return_complex parameter be given for real inputs, and will further require that return_complex=True in a future PyTorch release. (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/SpectralOps.cpp:659.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/home/sk/anaconda3/envs/vc/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/home/sk/anaconda3/envs/vc/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/sk/anaconda3/envs/vc/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/sk/anaconda3/envs/vc/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
INFO:bigvgan_22k_new:Saving model and optimizer state at iteration 1 to ./logs/bigvgan_22k_new/G_0.pth
INFO:bigvgan_22k_new:Saving model and optimizer state at iteration 1 to ./logs/bigvgan_22k_new/D_0.pth
INFO:root:Reducer buckets have been rebuilt in this iteration.
INFO:root:Reducer buckets have been rebuilt in this iteration.

Results?

First of all, thank you for the project, I think it is really useful, especially that the official NVIDIA implementation is not released yet!

Did you manage to train the model to satisfactory quality, and replicate the results from the paper? The one sample provided seems to be very early in training (30k steps).

Would it be possible to include some samples or pretrained model to see how the implementation works in practice?

If you have not trained, then I can try training it for 3 days on 4x 3090 and see what happens.
Thank you!

inference not working

Hi-
i've trained a model for 23000 steps. Using tensorboard, i can hear the eval improvements each checkpoint, and it sounds great!
The problem is, I am not able to generate any wav files using inference. The script I've created generates wav files, but they are all static & noise. It would be great if there were an inference script in the repo :)

Here is my config

{
  "train": {
    "log_interval": 10,
    "eval_interval": 100,
    "seed": 1234,
    "epochs": 2900,
    "learning_rate": 2e-4,
    "betas": [0.8, 0.99],
    "eps": 1e-9,
    "batch_size": 16,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45

  },
  "data": {
    "training_files": "./dataset/custom/preprocessed_npz",
    "validation_files":"./dataset/custom/preprocessed_npz",
    "text_cleaners":["english_cleaners2"],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": null,
    "add_blank": true,
    "n_speakers": 1,
    "cleaned_text": true,
    "aug_rate": 1.0,
    "top_db": 20
  },
  "model": {
    "p_dropout": 0.1,
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
    "upsample_rates": [8,8,2,2],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [16,16,4,4],
    "use_spectral_norm": false

  }
}

Here is a custom script im trying to put together.

import os
import torch

from glob import glob
import tqdm
import numpy as np
from scipy.io.wavfile import write
import utils

from mel_processing import mel_spectrogram_torch
from models_bigvgan import Generator

def main():
    print('Initializing Inference Process..')
    h = utils.get_hparams_from_dir('logs/bigvgan')
    torch.cuda.manual_seed(1234)
    device = torch.device('cuda')
    mel_channels = h.data.n_mel_channels
    #generator = Generator(h.data.filter_length // 2 + 1,
    #        h.model.resblock_kernel_sizes,h.model.resblock_dilation_sizes, h.model.upsample_rates, h.model.upsample_initial_channel, h.model.upsample_kernel_sizes).to(device)
    generator = Generator(mel_channels,
            h.model.resblock_kernel_sizes,h.model.resblock_dilation_sizes, h.model.upsample_rates, h.model.upsample_initial_channel, h.model.upsample_kernel_sizes).to(device)
    state_dict_g = utils.load_checkpoint(utils.latest_checkpoint_path("logs/bigvgan/", "G_*.pth"), generator)
    generator.eval()
    generator.remove_weight_norm()
    npz_path = glob(os.path.join("dataset/custom/preprocessed_npz/p100/", os.path.join("test", "*.npz")))
    print("data len: ", len(npz_path))
    print('Parameters:', generator.parameters())
    print(f'mel_channels {mel_channels}')
    for path in tqdm.tqdm(npz_path, desc="synthesizing each utterance"):
        files = np.load(path)
        file_name = os.path.splitext(os.path.basename(path))[0]
        with torch.no_grad():
            audio = torch.FloatTensor(files['audio'])
            audio = audio.to(device)
            audio = audio / 32768

            mel = mel_spectrogram_torch(audio.unsqueeze(0), h.data.filter_length, mel_channels, h.data.sampling_rate, h.data.hop_length, h.data.win_length, h.data.mel_fmin, h.data.mel_fmax)
            audio = generator(mel)
            audio = audio.squeeze()
            audio = audio / (torch.abs(audio).max()) * 0.999 * 32768.0
            audio = audio.cpu().numpy().astype('int16')
            file_name = "generated_{}.wav".format(file_name)
            output_file = os.path.join("inference/", file_name)
            write(output_file, 22050, audio)


if __name__ == '__main__':
    main()

correct eps value?

Configuration json contains value of eps as 1e-9.

In the transformers.py there's
def searchsorted(bin_locations, inputs, eps=1e-6):

Whilst in another issue you say that spec must be (and eps as 1e-6)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

Add link in README to dataset used

Hey there, I checked this out, but when I went to run process.py, my directory tree looked different than yours for VCTK. I downloaded it from here - the files are flac and the structure of the files is slightly different (more than one mic per each recording, etc), so it ends up throwing off the train/test splits.

I was able to update the scripts accordingly for my use case, but it may be nice for others coming into this project.

Which data download did you use?

How to reduce the parameters of the bigvgan model

Great work!This code reproduce the effect of the original paper very well.It has lots of parameters, which make it hard to train with acoustic model.Can you give me some advices of how to reduce the parameters of the bigvgan model?

Test with MEL spectrogram

Though you provide code in issue #1 , it doesn't work

This block change (along with eps 1e-9)

mel = spec_to_mel_torch(
                spec,
                hps.data.filter_length,
                hps.data.n_mel_channels,
                hps.data.sampling_rate,
                hps.data.mel_fmin,
                hps.data.mel_fmax)

        mel_for_loss = spec_to_mel_torch(
                spec,
                hps.data.filter_length,
                hps.data.n_mel_channels,
                hps.data.sampling_rate,
                hps.data.mel_fmin,
                hps.data.mel_fmax_for_loss)

leads to error

RuntimeError: Given groups=1, weight of size [512, 513, 7], expected input[16, 80, 32] to have 513 channels, but got 80 channels instead

My vctk_bigvgan.json

{
  "train": {
    "log_interval": 200,
    "eval_interval": 1000,
    "seed": 1234,
    "epochs": 20000,
    "learning_rate": 1e-4,
    "betas": [0.8, 0.99],
    "eps": 1e-9,
    "batch_size":16,
    "fp16_run": true,
    "lr_decay": 0.999875,
    "segment_size": 8192,
    "init_lr_ratio": 1,
    "warmup_epochs": 0,
    "c_mel": 45

  },
  "data": {
    "training_files": "./dataset/VCTK-Corpus/preprocessed_npz",
    "validation_files":"./dataset/VCTK-Corpus/preprocessed_npz",
    "text_cleaners":["english_cleaners2"],
    "max_wav_value": 32768.0,
    "sampling_rate": 22050,
    "filter_length": 1024,
    "hop_length": 256,
    "win_length": 1024,
    "n_mel_channels": 80,
    "mel_fmin": 0.0,
    "mel_fmax": 12000,
    "mel_fmax_for_loss": null,
    "add_blank": true,
    "n_speakers": 43,
    "cleaned_text": true,
    "aug_rate": 1.0,
    "top_db": 20
  },
  "model": {
    "p_dropout": 0.1,
    "resblock_kernel_sizes": [3,7,11],
    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
    "upsample_rates": [8,8,2,2],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [16,16,4,4],
    "use_spectral_norm": false

  }
}

IT'd be nice if it was actually possible to test with MEL spectrogram.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.