Learning Non-parallel Voice Conversion with Filling in Frames

Takuhiro Kaneko    Hirokazu Kameoka    Kou Tanaka    Nobukatsu Hojo
NTT Communication Science Laboratories, NTT Corporation

arXiv:2102.12841, Feb. 2021

Please check out our relevant work!

Previous work: Other relevant work:


Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo
MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
ICASSP 2021 (arXiv:2102.12841, Feb. 2021)
[Paper] [Slides] [Poster] [BibTeX]


Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC [1] and CycleGAN-VC2 [2]) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3 [3], an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2.

Figure 1. Pipeline of Filling in Frames (FIF). We encourage the converter to fill in the missing frames (surrounded by the red box) based on the surrounding frames through a cyclic conversion process.

Conversion samples

Recommended browsers: Safari, Chrome, Firefox, and Opera.

Experimental conditions

  • We evaluated our method on the Spoke (i.e., non-parallel VC) task of the Voice Conversion Challenge 2018 (VCC 2018) [4].
  • For each speaker, 81 sentences (approximately 5 min in length, which is relatively short for VC) were used for training.
  • The training set contains no overlapping utterances between the source and target speakers; therefore, we need to learn a converter in a fully non-parallel setting.
  • We used MelGAN [5] as a vocoder.

Compared models


Female (VCC2SF3) → Male (VCC2TM1)

Source Target V2 V3 Mask
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Female (VCC2TF1)

Source Target V2 V3 Mask
Sample 1
Sample 2
Sample 3

Female (VCC2SF3) → Female (VCC2TF1)

Source Target V2 V3 Mask
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Male (VCC2TM1)

Source Target V2 v3 Mask
Sample 1
Sample 2
Sample 3


[1] T. Kaneko, H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]
[2] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]
[3] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo. CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion. Interspeech, 2020. [Paper] [Project]
[4] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Odyssey, 2018. [Paper] [Dataset]
[5] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. NeurIPS, 2019. [Paper] [Project]