Wave-U-Net Discriminator

Check out our related work:

iSTFTNet (ICASSP 2022): Fast and lightweight neural vocoder using iSTFT
iSTFTNet2 (Interspeech 2023): Faster and more lightweight iSTFTNet using 1D-2D CNN
MISRNet (Interspeech 2022): Lightweight neural vocoder using multi-input single shared residual blocks
AugCondD (ICASSP 2024): Augmentation-conditional discriminator for limited data

Abstract

In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS.

Audio Samples

I. Evaluation on neural vocoders

Comparison among neural vocoders on LJSpeech
Comparison among neural vocoders on VCTK
Comparison among neural vocoders on JSUT

II. Evaluation on end-to-end text-to-speech

Comparison among end-to-end text-to-speech systems

I. Evaluation on neural vocoders

i. Comparison among neural vocoders on LJSpeech

Dataset: LJSpeech [1] (single English speaker)

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth

HiFi-GAN [2]
Wave-U-Net D

ii. Comparison among neural vocoders on VCTK

Dataset: VCTK [3] (multiple English speakers)

1. Seen speakers

	Sample 1 (p232)	Sample 2 (p260)	Sample 3 (p268)	Sample 4 (p347)	Sample 5 (p361)
Ground truth

HiFi-GAN [2]
Wave-U-Net D

2. Unseen speaker (speaker s5, not used for training, was used)

	Sample 1 (s5)	Sample 2 (s5)	Sample 3 (s5)	Sample 4 (s5)	Sample 5 (s5)
Ground truth

HiFi-GAN [2]
Wave-U-Net D

iii. Comparison among neural vocoders on JSUT

Dataset: JSUT [4] (single Japanese speaker)

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth

HiFi-GAN
Wave-U-Net D

II. Evaluation on end-to-end text-to-speech

i. Comparison among end-to-end text-to-speech systems

Dataset: LJSpeech (single English speaker)

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth

VITS [5]
Wave-U-Net D

Citation

@inproceedings{kaneko2023waveunetd,
  title={{Wave-U-Net Discriminator}: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Seki, Shogo},
  booktitle={ICASSP},
  year={2023},
}

References

K. Ito, L. Johnson. The LJ Speech Dataset. 2017.
J. Kong, J. Kim, J. Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh. The Centre for Speech Technology Research, 2017.
R. Sonobe, S. Takamichi, H. Saruwatari. JSUT Corpus: Free Large-Scale Japanese Speech Corpus for End-to-End Speech Synthesis. arXiv preprint arXiv:1711.00354, 2017.
J. Kim, J. Kong, J. Son. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML, 2021.