VoiceGrad: Non-parallel Any-to-Many Voice Conversion with<br>Annealed Langevin Dynamics

Papers

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Yuto Kondo, "LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models," arXiv:2509.08379 [cs.SD], Sep. 2025.
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Yuto Kondo, "LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models," submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing, 2025.

VoiceGrad and LatentVoiceGrad

VoiceGrad [1][2] is a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a diffusion probabilistic model (DPM) [3]. The score network is designed to be conditioned on a speaker embedding, timestep, and phoneme embedding sequence, and is trained using speech samples from various speakers. The trained score network can be used to execute VC by iteratively adjusting an input mel-spectrogram until resembling the target speaker’s. The original VoiceGrad used a one-hot vector for the speaker embedding, limiting it to converting input speech to the voices of known speakers. Here, we modify it to use the speaker embedding generated by a pre-trained speaker encoder from a reference voice, enabling zero-shot conversion (i.e., any-to-any conversion). LatentVoiceGrad is an improved version of VoiceGrad, which applies reverse diffusion to the autoencoder bottleneck features of mel-spectrograms obtained using an adversarially trained autoencoder. Additionally, a flow matching (FM) model [4] is also introduced as an alternative to the DPM to further speed up the conversion process. This results in four possible approaches, depending on whether we use the mel-spectrogram or the autoencoder bottleneck as the data to be converted, and whether the DPM or FM model serves as the underlying generative model. Below, these are referred to as VoiceGrad-DPM, LatentVoiceGrad-DPM, VoiceGrad-FM, and LatentVoiceGrad-FM.

Concepts of VoiceGrad and LatentVoiceGrad
(Click to enlarge)

Links to related pages

Please also refer to the following web sites.

Previous version of VoiceGrad
Links to my other work

Audio examples

Speaker identity conversion

Here are audio examples of speaker identity conversion tested on the CSTR VCTK Corpus (version 0.92) [5], which contains speech data from 110 English speakers with various accents. To simulate a zero-shot any-to-any VC scenario, utterances from ten speakers (p238, p241, p243, p252, p261, p294, p334, p343, p360, and p362) were used as test data and those from the remaining speakers as training data. This resulted in 90 source-target speaker combinations for testing, with the task being to convert input seech into a voice similar to that of a reference utterance. Audio examples obtained with a DPM-based VC method similar to ours (Diff-VC) [6] are also provided for comparison.

Change sentence (current: #001)

Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p238	p241
	p243
	p252
	p261
	p294
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p241	p238
	p243
	p252
	p261
	p294
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p243	p238
	p241
	p252
	p261
	p294
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p252	p238
	p241
	p243
	p261
	p294
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p261	p238
	p241
	p243
	p252
	p294
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p294	p238
	p241
	p243
	p252
	p261
	p334
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p334	p238
	p241
	p243
	p252
	p261
	p294
	p343
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p343	p238
	p241
	p243
	p252
	p261
	p294
	p334
	p360
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p360	p238
	p241
	p243
	p252
	p261
	p294
	p334
	p343
	p362
Input	Reference	Diff-VC	VoiceGrad		LatentVoiceGrad
Input	Reference	Diff-VC	DPM	FM	DPM	FM
p362	p238
	p241
	p243
	p252
	p261
	p294
	p334
	p343
	p360

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and S. Seki, "VoiceGrad: Non-parallel any-to-many voice conversion with annealed Langevin dynamics," arXiv:2010.02977 [cs.SD], 2020.

[2] H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and S. Seki, "VoiceGrad: Non-parallel any-to-many voice conversion with annealed Langevin dynamics," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2213–2226, 2024.

[3] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," in Adv. Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 6840–6851.

[4] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, "Flow matching for generative modeling," in Proc. International Conference on Learning Representations (ICLR), 2023, pp. 1–28.

[5] https://datashare.ed.ac.uk/handle/10283/3443

[6] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, M. Kudinov, and J. Wei, "Diffusion-based voice conversion with fast maximum likelihood sampling scheme," in Proc. International Conference on Learning Representations (ICLR), 2022, pp. 1-23.


Concepts of VoiceGrad and LatentVoiceGrad (Click to enlarge)

VoiceGrad-DPM's VC process (Click to enlarge)

LatentVoiceGrad-DPM's VC process in the latent space (left) and mel-spectrogram domain (right) (Click to enlarge)

VoiceGrad-FM's VC process (Click to enlarge)

LatentVoiceGrad-FM's VC process in the latent space (left) and mel-spectrogram domain (right) (Click to enlarge)

LatentVoiceGrad

Non-parallel Any-to-Any Voice Conversion with
Latent Diffusion/Flow-Matching Models

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Yuto Kondo

NTT Communication Science Laboratories, NTT Corporation

Papers

VoiceGrad and LatentVoiceGrad

Links to related pages

Audio examples

Speaker identity conversion

References

LatentVoiceGrad

Non-parallel Any-to-Any Voice Conversion withLatent Diffusion/Flow-Matching Models

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Yuto Kondo

NTT Communication Science Laboratories, NTT Corporation

Papers

VoiceGrad and LatentVoiceGrad

Links to related pages

Audio examples

Speaker identity conversion

References

Non-parallel Any-to-Any Voice Conversion with
Latent Diffusion/Flow-Matching Models