MeanVoiceFlow

One-step Nonparallel Voice Conversion with Mean Flows

ICASSP 2026

Paper

teaser
Fig. 1. Comparison of (a) instantaneous velocity used in conventional flow matching and (b) average velocity used in mean flows. (a) Instantaneous velocity (blue arrow) represents the tangent direction of the path for a single time step t. (b) Average velocity (orange arrow) aligns with the displacement between two time steps r and t. In MeanVoiceFlow, a zero-input constraint is imposed on the average velocity for a zero-input sample (green arrow) using a structural margin reconstruction loss to moderately guide learning.

TL;DR

We propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation.


Abstract

In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input--output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch.


Contents

  1. Analysis of zero-input constraint (Table 1)
  2. Analysis of conditional diffused-input training (Fig. 3)
  3. Comparisons with prior models (Table 2)
  4. Versatility analysis (Table 3)

Results

I. Analysis of zero-input constraint (Table 1)
  • (E): SSIM + margin + zero-input constraint (proposed configuration)
Source Target (A) (B) (C) (D) (E) (F)
Metric L1 L2 SSIM
SSIM
SSIM
Margin
Input Zero Zero Zero
Zero
All
Female → Female
Male → Male
Female → Male
Male → Female

Incorporating the reconstruction loss with (1) structural comparison (SSIM), (2) margin-based relaxation, and (3) selective application (zero-input constraint) is crucial for achieving high-fidelity synthesis.


II. Analysis of conditional diffused-input training (Fig. 3)
  • w/: With conditional diffused-input training (proposed training)
  • w/o: Without conditional diffused-input training

Note: w/o may produce buzzy sounds at t' = 0.5 and t' = 0.0; please adjust the volume accordingly.

Source Target w/o w/ w/o w/ w/o w/
t' 1.0 1.0 0.5 0.5 0.0 0.0
Female → Female
Male → Male
Female → Male
Male → Female

Conditional diffused-input training (w/) enhances both robustness to the mixing ratio t' and peak performance.


III. Comparisons with prior models (Table 2)
  • MVF: MeanVoiceFlow (proposed model)
Source Target DiffVC-30 VG-DM-1 VG-DM-30 VG-FM-1 VG-FM-30 FVG FVG+ MVF
NFE 30 1 30 1 30 1 1
1
From scratch
Female → Female
Male → Male
Female → Male
Male → Female

MeanVoiceFlow (MVF) outperforms one-step models trained from scratch, such as VoiceGrad-DM-1 (VG-DM-1) and VoiceGrad-FM-1 (VG-FM-1), and achieves performance comparable to that of multi-step models, including VoiceGrad-DM-30 (VG-DM-30) and VoiceGrad-FM-30 (VG-FM-30), as well as one-step models enhanced by distillation and adversarial training, such as FastVoiceGrad (FVG) and FastVoiceGrad+ (FVG+).


IV. Versatility analysis (Table 3)
  • MVF: MeanVoiceFlow (proposed model)
Source Target VG-DM-1 VG-DM-30 VG-FM-1 VG-FM-30 MVF
NFE 1 30 1 30
1
Female → Female
Male → Male
Female → Male
Male → Female

On another dataset, MeanVoiceFlow (MVF) outperforms one-step models trained from scratch, such as VoiceGrad-DM-1 (VG-DM-1) and VoiceGrad-FM-1 (VG-FM-1), and achieves performance comparable to that of multi-step models, including VoiceGrad-DM-30 (VG-DM-30) and VoiceGrad-FM-30 (VG-FM-30).


Citation

@inproceedings{kaneko2026meanvoiceflow,
  title={MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows},
  author={Kaneko, Takuhiro and Kameoka, Hirokazu and Tanaka, Kou and Kondo, Yuto},
  booktitle={ICASSP},
  year={2026},
}