VPFD

teaser — **Vocoder-projected feature discriminator (proposed)**

TL;DR

We propose a vocoder-projected feature discriminator (VPFD), which leverages vocoder features to facilitate faster and more efficient adversarial training.

Abstract

In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.

Table 1: Comparison of performance with varying number of upsampling steps
Table 2: Analysis of the importance of pretraining and freezing the vocoder feature extractor
Table 3: Comparison with other training acceleration techniques
Table 4: Subjective evaluations
Table 5: Results on LibriTTS dataset

Results

Table 1: Comparison of performance with varying number of upsampling steps

	Source	Target	FVG	FVG+VPFD₀	FVG+VPFD₁	FVG+VPFD₂	FVG+VPFD₃	FVG+VPFD₄
Female → Female
Male → Male
Female → Male
Male → Female

Table 2: Analysis of the importance of pretraining and freezing the vocoder feature extractor

	FVG+VPFD₁	FVG+VPFD₁	FVG+VPFD₁
Pretrained	✓		✓
Frozen		✓	✓
Female → Female
Male → Male
Female → Male
Male → Female

Table 3: Comparison with other training acceleration techniques

	Source	Target	FVG	FVG_early	FVG w/o MRD	FVG w/o MPD	FVG+MelD_small	FVG+MelD_large	FVG+VPFD₁
Female → Female
Male → Male
Female → Male
Male → Female

Table 4: Subjective evaluations

	Source	Target	DiffVC-30	FVG	FVG+VPFD₁
Female → Female
Male → Male
Female → Male
Male → Female

Table 5: Results on LibriTTS dataset

	Source	Target	FVG	FVG+VPFD₁
Female → Female
Male → Male
Female → Male
Male → Female

Citation

@inproceedings{kaneko2025vpfd,
  title={Vocoder-Projected Feature Discriminator},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Kondo, Yuto},
  booktitle={Interspeech},
  year={2025},
}

Our related work

VoiceGrad: Multi-step diffusion-based VC model
FastVoiceGrad: One-step diffusion-based VC model (used as a baseline in this work)
FasterVoiceGrad: Faster one-step diffusion-based VC model