Check out our related work:
- iSTFTNet (ICASSP 2022): Fast and lightweight neural vocoder using iSTFT
- iSTFTNet2 (Interspeech 2023): Faster and more lightweight iSTFTNet using 1D-2D CNN
- MISRNet (Interspeech 2022): Lightweight neural vocoder using multi-input single shared residual blocks
- AugCondD (ICASSP 2024): Augmentation-conditional discriminator for limited data
Abstract
In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS.
Audio Samples
I. Evaluation on neural vocoders
- Comparison among neural vocoders on LJSpeech
- Comparison among neural vocoders on VCTK
- Comparison among neural vocoders on JSUT
II. Evaluation on end-to-end text-to-speech
I. Evaluation on neural vocoders
i. Comparison among neural vocoders on LJSpeech
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
---|---|---|---|---|---|
Ground truth |
HiFi-GAN [2] | |||||
---|---|---|---|---|---|
Wave-U-Net D |
ii. Comparison among neural vocoders on VCTK
1. Seen speakers
Sample 1 (p232) | Sample 2 (p260) | Sample 3 (p268) | Sample 4 (p347) | Sample 5 (p361) | |
---|---|---|---|---|---|
Ground truth |
HiFi-GAN [2] | |||||
---|---|---|---|---|---|
Wave-U-Net D |
2. Unseen speaker (speaker s5, not used for training, was used)
Sample 1 (s5) | Sample 2 (s5) | Sample 3 (s5) | Sample 4 (s5) | Sample 5 (s5) | |
---|---|---|---|---|---|
Ground truth |
HiFi-GAN [2] | |||||
---|---|---|---|---|---|
Wave-U-Net D |
iii. Comparison among neural vocoders on JSUT
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
---|---|---|---|---|---|
Ground truth |
HiFi-GAN | |||||
---|---|---|---|---|---|
Wave-U-Net D |
II. Evaluation on end-to-end text-to-speech
i. Comparison among end-to-end text-to-speech systems
- Dataset: LJSpeech (single English speaker)
Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
---|---|---|---|---|---|
Ground truth |
VITS [5] | |||||
---|---|---|---|---|---|
Wave-U-Net D |
Citation
@inproceedings{kaneko2023waveunetd, title={{Wave-U-Net Discriminator}: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis}, author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Seki, Shogo}, booktitle={ICASSP}, year={2023}, }
References
- K. Ito, L. Johnson. The LJ Speech Dataset. 2017.
- J. Kong, J. Kim, J. Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
- J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh. The Centre for Speech Technology Research, 2017.
- R. Sonobe, S. Takamichi, H. Saruwatari. JSUT Corpus: Free Large-Scale Japanese Speech Corpus for End-to-End Speech Synthesis. arXiv preprint arXiv:1711.00354, 2017.
- J. Kim, J. Kong, J. Son. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML, 2021.