FasterVoiceGrad

teaser — **FasterVoiceGrad (proposed):** Jointly distilling a diffusion model and a content encoder

TL;DR

We propose FasterVoiceGrad, a novel one-step diffusion-based voice conversion (VC) model, obtained by jointly distilling a diffusion model and a content encoder through adversarial diffusion conversion distillation (ADCD).

Abstract

A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6--6.9 and 1.8 times faster speed on a GPU and CPU, respectively.

Component analysis (Section 4.2)
Comparison with direct distillation (Section 4.2)
Comparison with FastVoiceGrad (Section 4.3)
Results on LibriTTS (Section 4.4)

Results

I. Component analysis (Section 4.2)

	Source	Target	FastVoiceGrad+p_φ	+ Conversion	+ Reconversion	+ Inverse (Full)
Female → Female
Male → Male
Female → Male
Male → Female

II: Comparison with direct distillation (Section 4.2)

	Direct Distillation	FasterVoiceGrad	Direct Distillation	FasterVoiceGrad	Direct Distillation	FasterVoiceGrad
# Layers	1	1	3	3	6	6
Female → Female
Male → Male
Female → Male
Male → Female

III. Comparison with FastVoiceGrad (Section 4.3)

	Source	Target	DiffVC-30	FastVoiceGrad	FasterVoiceGrad
Female → Female
Male → Male
Female → Male
Male → Female

IV. Results on LibriTTS (Section 4.4)

	Source	Target	FastVoiceGrad	FasterVoiceGrad
Female → Female
Male → Male
Female → Male
Male → Female

Citation

@inproceedings{kaneko2025fastervoicegrad,
  title={FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Kondo, Yuto},
  booktitle={Interspeech},
  year={2025},
}

Our related work

VoiceGrad: Multi-step diffusion-based VC model
FastVoiceGrad: One-step diffusion-based VC model (used as a baseline in this work)
VPFD: Enables faster and more efficient adversarial training