Wave-U-Net Discriminator

Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

ICASSP 2023

Paper

Wave-U-Net Discriminator
Figure 1. Overview of GAN training with a Wave-U-Net discriminator. A Wave-U-Net discriminator is unique in that it assesses a waveform in a sample-wise manner in the same resolution as the input signal while extracting multilevel features via an encoder and decoder with skip connections.

Note

Check out our relevant work: iSTFTNet / iSTFTNet2 / MISRNet

Abstract

In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS.


Audio Samples

I. Evaluation on neural vocoders
  1. Comparison among neural vocoders on LJSpeech
  2. Comparison among neural vocoders on VCTK
  3. Comparison among neural vocoders on JSUT
II. Evaluation on end-to-end text-to-speech
  1. Comparison among end-to-end text-to-speech systems

I. Evaluation on neural vocoders

i. Comparison among neural vocoders on LJSpeech
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Ground truth
HiFi-GAN [2]
Wave-U-Net D
ii. Comparison among neural vocoders on VCTK
  • Dataset: VCTK [3] (multiple English speakers)
1. Seen speakers
Sample 1 (p232) Sample 2 (p260) Sample 3 (p268) Sample 4 (p347) Sample 5 (p361)
Ground truth
HiFi-GAN [2]
Wave-U-Net D
2. Unseen speaker (speaker s5, not used for training, was used)
Sample 1 (s5) Sample 2 (s5) Sample 3 (s5) Sample 4 (s5) Sample 5 (s5)
Ground truth
HiFi-GAN [2]
Wave-U-Net D
iii. Comparison among neural vocoders on JSUT
  • Dataset: JSUT [4] (single Japanese speaker)
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Ground truth
HiFi-GAN
Wave-U-Net D

II. Evaluation on end-to-end text-to-speech

i. Comparison among end-to-end text-to-speech systems
  • Dataset: LJSpeech (single English speaker)
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Ground truth
VITS [5]
Wave-U-Net D

Citation

@inproceedings{kaneko2023waveunetd,
  title={{Wave-U-Net Discriminator}: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Seki, Shogo},
  booktitle={ICASSP},
  year={2023},
}

References

  1. K. Ito, L. Johnson. The LJ Speech Dataset. 2017.
  2. J. Kong, J. Kim, J. Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
  3. J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh. The Centre for Speech Technology Research, 2017.
  4. R. Sonobe, S. Takamichi, H. Saruwatari. JSUT Corpus: Free Large-Scale Japanese Speech Corpus for End-to-End Speech Synthesis. arXiv preprint arXiv:1711.00354, 2017.
  5. J. Kim, J. Kong, J. Son. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML, 2021.