FasterVoiceGrad

Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

Interspeech 2025

Paper

teaser
FasterVoiceGrad (proposed): Jointly distilling a diffusion model and a content encoder

teaser
Cf. FastVoiceGrad (previous): Distilling only a diffusion model

TL;DR

We propose FasterVoiceGrad, a novel one-step diffusion-based voice conversion (VC) model, obtained by jointly distilling a diffusion model and a content encoder through adversarial diffusion conversion distillation (ADCD).

Abstract

A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6--6.9 and 1.8 times faster speed on a GPU and CPU, respectively.

Contents

  1. Component analysis (Section 4.2)
  2. Comparison with direct distillation (Section 4.2)
  3. Comparison with FastVoiceGrad (Section 4.3)
  4. Results on LibriTTS (Section 4.4)

Results

I. Component analysis (Section 4.2)
Source Target FastVoiceGrad+pφ + Conversion + Reconversion + Inverse (Full)
Female → Female
Male → Male
Female → Male
Male → Female

II: Comparison with direct distillation (Section 4.2)
Source Target Direct Distillation FasterVoiceGrad Direct Distillation FasterVoiceGrad Direct Distillation FasterVoiceGrad
# Layers 1 1 3 3 6 6
Female → Female
Male → Male
Female → Male
Male → Female

III. Comparison with FastVoiceGrad (Section 4.3)
Source Target DiffVC-30 FastVoiceGrad FasterVoiceGrad
Female → Female
Male → Male
Female → Male
Male → Female

IV. Results on LibriTTS (Section 4.4)
Source Target FastVoiceGrad FasterVoiceGrad
Female → Female
Male → Male
Female → Male
Male → Female

Citation

@inproceedings{kaneko2025fastervoicegrad,
  title={FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Kondo, Yuto},
  booktitle={Interspeech},
  year={2025},
}