AugCondD

Check out our related work:

Abstract

Audio Samples

Citation

I. Benchmark performance (related to Table 1)

II. Results with different network architectures (related to Table 2)

III. Results with different augmentation methods (related to Table 3)

IV. Results for different speakers (related to Table 4)

iSTFTNet (ICASSP 2022): Fast and lightweight neural vocoder using iSTFT
iSTFTNet2 (Interspeech 2023): Faster and more lightweight iSTFTNet using 1D-2D CNN
MISRNet (Interspeech 2022): Lightweight neural vocoder using multi-input single shared residual blocks
WaveUNetD (ICASSP 2023): Fast and lightweight discriminator using Wave-U-Net

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD or ACD for short) that receives the augmentation state as input in addition to speech, thereby assessing input speech according to augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions.

Benchmark performance (related to Table 1)
Results with different network architectures (related to Table 2)
Results with different augmentation methods (related to Table 3)
Results for different speakers (related to Table 4)

Dataset: LJSpeech

	Data	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth

Dataset: LJSpeech

	Data	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
HiFiV2-mix	1%
HiFiV2-ACD-mix	1%

iSTFT-mix	1%
iSTFT-ACD-mix	1%

Dataset: LJSpeech

	Data	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
HiFi-rate	1%
HiFi-ACD-rate	1%

Dataset: LibriTTS, ID 260

	ID	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth	260

HiFi-mix	260
HiFi-ACD-mix	260

Dataset: LibriTTS, ID 1580

	ID	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Ground truth	1580

HiFi-mix	1580
HiFi-ACD-mix	1580

HiFi-phase

HiFi-ACD-mix

@inproceedings{kaneko2024augcondd,
  title={Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou},
  booktitle={ICASSP},
  year={2024},
}