MeanVoiceFlow

teaser — **Fig. 1.** Comparison of (a) instantaneous velocity used in conventional **flow matching** and (b) average velocity used in **mean flows**. (a) *Instantaneous velocity* (blue arrow) represents the tangent direction of the path for a single time step t. (b) *Average velocity* (orange arrow) aligns with the displacement between two time steps r and t. In MeanVoiceFlow, a *zero-input constraint* is imposed on the average velocity for a zero-input sample (green arrow) using a *structural margin reconstruction loss* to moderately guide learning.

TL;DR

We propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation.

Abstract

In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input--output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch.

Analysis of zero-input constraint (Table 1)
Analysis of conditional diffused-input training (Fig. 3)
Comparisons with prior models (Table 2)
Versatility analysis (Table 3)

Results

I. Analysis of zero-input constraint (Table 1)

(E): SSIM + margin + zero-input constraint (proposed configuration)

	(A)	(B)	(C)	(D)	(E)	(F)
Metric	—	L1	L2	SSIM	SSIM	SSIM
Margin	—	—	—	—	✓	✓
Input	—	Zero	Zero	Zero	Zero	All
Female → Female
Male → Male
Female → Male
Male → Female

Incorporating the reconstruction loss with (1) structural comparison (SSIM), (2) margin-based relaxation, and (3) selective application (zero-input constraint) is crucial for achieving high-fidelity synthesis.

II. Analysis of conditional diffused-input training (Fig. 3)

w/: With conditional diffused-input training (proposed training)
w/o: Without conditional diffused-input training

Note: w/o may produce buzzy sounds at t' = 0.5 and t' = 0.0; please adjust the volume accordingly.

	Source	Target	w/o	w/	w/o	w/	w/o	w/
t'	–	–	1.0	1.0	0.5	0.5	0.0	0.0
Female → Female
Male → Male
Female → Male
Male → Female

Conditional diffused-input training (w/) enhances both robustness to the mixing ratio t' and peak performance.

III. Comparisons with prior models (Table 2)

MVF: MeanVoiceFlow (proposed model)

	DiffVC-30	VG-DM-1	VG-DM-30	VG-FM-1	VG-FM-30	FVG	FVG+	MVF
NFE	30	1	30	1	30	1	1	1
From scratch	✓	✓	✓	✓	✓	—	—	✓
Female → Female
Male → Male
Female → Male
Male → Female

MeanVoiceFlow (MVF) outperforms one-step models trained from scratch, such as VoiceGrad-DM-1 (VG-DM-1) and VoiceGrad-FM-1 (VG-FM-1), and achieves performance comparable to that of multi-step models, including VoiceGrad-DM-30 (VG-DM-30) and VoiceGrad-FM-30 (VG-FM-30), as well as one-step models enhanced by distillation and adversarial training, such as FastVoiceGrad (FVG) and FastVoiceGrad+ (FVG+).

IV. Versatility analysis (Table 3)

MVF: MeanVoiceFlow (proposed model)

	VG-DM-1	VG-DM-30	VG-FM-1	VG-FM-30	MVF
NFE	1	30	1	30	1
Female → Female
Male → Male
Female → Male
Male → Female

On another dataset, MeanVoiceFlow (MVF) outperforms one-step models trained from scratch, such as VoiceGrad-DM-1 (VG-DM-1) and VoiceGrad-FM-1 (VG-FM-1), and achieves performance comparable to that of multi-step models, including VoiceGrad-DM-30 (VG-DM-30) and VoiceGrad-FM-30 (VG-FM-30).

Citation

@inproceedings{kaneko2026meanvoiceflow,
  title={MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Kondo, Yuto},
  booktitle={ICASSP},
  year={2026},
}

Related work

VoiceGrad: Multi-step diffusion-based VC model
FastVoiceGrad: One-step diffusion-based VC model
FasterVoiceGrad: Faster one-step diffusion-based VC model
VPFD: Enables faster and more efficient adversarial training
LatentVoiceGrad: Latent diffusion/flow-matching-based VC model