CycleGAN-VC3:

Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

Takuhiro Kaneko    Hirokazu Kameoka    Kou Tanaka    Nobukatsu Hojo
NTT Communication Science Laboratories, NTT Corporation

Interspeech 2020
arXiv:2010.11672, Oct. 2020

Paper

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo
CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion
Interspeech 2020
(arXiv:2010.11672, Oct. 2020)
[Paper] [Slides] [BibTeX]

Our relevant previous work:
CycleGAN-VC [1], CycleGAN-VC2 [2], StarGAN-VC [3], StarGAN-VC2 [4], and Other demos


CycleGAN-VC3

Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, CycleGAN-VC [1] and CycleGAN-VC2 [2] have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-spectrogram conversion, they are typically used for mel-cepstrum conversion even when comparative methods employ mel-spectrogram as a conversion target. To address this, we examined the applicability of CycleGAN-VC/VC2 to mel-spectrogram conversion. Through initial experiments, we discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion. To remedy this, we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates time-frequency adaptive normalization (TFAN). Using TFAN, we can adjust the scale and bias of the converted features while reflecting the time-frequency structure of the source mel-spectrogram. We evaluated CycleGAN-VC3 on inter-gender and intra-gender non-parallel VC. A subjective evaluation of naturalness and similarity showed that for every VC pair, CycleGAN-VC3 outperforms or is competitive with the two types of CycleGAN-VC2, one of which was applied to mel-cepstrum and the other to mel-spectrogram.

comparison
Figure 1. We developed time-frequency adaptive normalization (TFAN), which extends instance normalization [5] so that the affine parameters become element-dependent and are determined according to an entire input mel-spectrogram.

Conversion samples

Recommended browsers: Safari, Chrome, Firefox, and Opera.

Experimental conditions

  • We evaluated our method on the Spoke (i.e., non-parallel VC) task of the Voice Conversion Challenge 2018 (VCC 2018) [6].
  • For each speaker, 81 utterances (approximately 5 minutes; which is relatively low for VC) were used for training.
  • In the training set, there are no overlaps between source and target utterances; therefore, this problem must be solved in a fully non-parallel setting.
  • We did not use any extra data, module, or time alignment procedures for training.

Compared models

Results

Female (VCC2SF3) → Female (VCC2TF1)

Source Target B V1 V2 V3
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Male (VCC2TM1)

Source Target B V1 V2 v3
Sample 1
Sample 2
Sample 3

Female (VCC2SF3) → Male (VCC2TM1)

Source Target B V1 V2 V3
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Female (VCC2TF1)

Source Target B V1 V2 V3
Sample 1
Sample 2
Sample 3

References

[1] T. Kaneko, H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]
[2] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]
[3] H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. SLT, 2018. [Paper] [Project]
[4] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo. StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion. Interspeech, 2019. [Paper] [Project]
[5] D. Ulyanov, A. Vedaldi, V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022, 2016. [Paper]
[6] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Odyssey, 2018. [Paper] [Dataset]
[7] M. Morise, F. Yokomori, and K. Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans. Inf. Syst., 2016. [Paper] [Project]
[8] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. NeurIPS, 2019. [Paper] [Project]