rishikksh20 / fre-gan-pytorch Goto Github PK

View Code? Open in Web Editor NEW

101.0 7.0 33.0 614 KB

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

License: MIT License

Python 100.00%

vocoder tts text-to-speech speech-synthesis speech

fre-gan-pytorch's Introduction

Fre-GAN Vocoder

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

Training:

python train.py --config config.json

Citation:

@misc{kim2021fregan,
      title={Fre-GAN: Adversarial Frequency-consistent Audio Synthesis}, 
      author={Ji-Hoon Kim and Sang-Hoon Lee and Ji-Hyun Lee and Seong-Whan Lee},
      year={2021},
      eprint={2106.02297},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Note

For more complete and end to end Voice cloning or Text to Speech (TTS) toolbox please visit Deepsync Technologies.

References:

fre-gan-pytorch's People

Contributors

Stargazers

Watchers

fre-gan-pytorch's Issues

How do I use this with my tacotron2 model?

about the remove_weight_norm

Fre-GAN-pytorch/generator.py

Line 170 in 91d0e46

l.remove_weight_norm()

I think this line should be remove_weight_norm(l), l.remove_weight_norm() will result in AttributeError: 'Sequential' object has no attribute 'remove_weight_norm'.

Fre-GAN-pytorch/generator.py

Line 172 in 91d0e46

l.remove_weight_norm()

there should be remove_weight_norm(l[1]). the previous form will result in AttributeError: 'Sequential' object has no attribute 'remove_weight_norm'

Training multi-speaker

Have you trained multi-speaker fre-gan? What is the better config on multi-speaker dataset?

Pre-trained model

Is there a pre-trained model available? It would save many hours of initial training. Many thanks.

Output quality difference between regular and fine tuned training

How much is the difference between tuned vs regular training

Hi! How this work compares with UnivNet for which one you already implemented code: https://github.com/rishikksh20/UnivNet-pytorch
This paper is a little bit newer but afaik they're more concerned about generalizability of model for unseen speakers whlie this work focuses on overall quality (especially in high frequences)
can you maybe elaborate?

do nn upsample before mel condition

for generator code line 137:

if i >= self.cond_level: 
                mel = self.cond_up[i - self.cond_level](mel)
                x += mel
if i > self.cond_level:
    if output is None:
        output = self.res_output[i - self.cond_level - 1](x)
    else:
        output = self.res_output[i - self.cond_level - 1](output)

in the code, for the nn upsample input is: mel condition + resblock output.

but in the paper, nn upsample input only is resblock output or the last nn upsample output;

so, Is this more reasonable?

if i > self.cond_level:
    if output is None:
        output = self.res_output[i - self.cond_level - 1](x)
    else:
        output = self.res_output[i - self.cond_level - 1](output)
if i >= self.cond_level: 
                mel = self.cond_up[i - self.cond_level](mel)
                x += mel

Inconsistency with paper

https://arxiv.org/pdf/2106.02297.pdf

In section 2.3
"After each level of DWT, all the frequency sub-bands are channel-wise concatenated and passed to convolutional layers"

Fre-GAN-pytorch/discriminator.py

Lines 242 to 246 in 91d0e46

    
           if i == 0: 
        
               x = torch.cat([x, x_d1], dim=2) 
        
           if i == 1: 
        
               x = torch.cat([x, x_d2], dim=2) 
        
           i = i + 1

You are concatenating on the length dim resulting in an odd looking tensor where the first half is audio features and the 2nd half is DWT features, and local waveform/DWT information can't mix properly.

Is there any reason for this? I feel very confused looking at this, but you've done it twice so I assume there's some reason for this.

compare with hifigan

Hello, it is a great work. When use the predict mel spectrum , which is better between fregan and hifigan ?
Thanks in advance.

	if i == 0:
	x = torch.cat([x, x_d1], dim=2)
	if i == 1:
	x = torch.cat([x, x_d2], dim=2)
	i = i + 1