StarGAN-VC2:

Rethinking Conditional Methods for StarGAN-Based Voice Conversion

Takuhiro Kaneko    Hirokazu Kameoka    Kou Tanaka    Nobukatsu Hojo
NTT Communication Science Laboratories, NTT Corporation

Interspeech 2019
arXiv:1907.12279, July 2019

Paper

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
Interspeech 2019
[Paper] [Poster] [BibTeX]

NOTE: We have previously proposed StarGAN-VC (ver1), CycleGAN-VC, and CycleGAN-VC2. They are highly related to this work. Check them from the corresponding links!


StarGAN-VC2

To advance the research on multi-domain non-parallel VC, we rethink conditional methods in StarGAN-VC [1] and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner.

network

Converted speech samples

Task and dataset
  • We evaluated our method on the non-parallel multi-speaker VC task.
  • We used the Voice Conversion Challenge 2018 (VCC 2018) dataset [2] in which we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2.
  • Each speaker has 81 sentences (about 5 minutes) for training. This is relatively little for VC.
  • Our goal is to learn 4 x 3 = 12 different source-and-target mappings only using a single model.
  • Note that we did not use any extra data, module, or time alignment procedures for training.
Results
We summarize the results in three ways:

NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.


1. Comparison between StarGAN-VC [1] and StarGAN-VC2

Notation
  • Source is the source speech samples.
  • Target is the target speech samples. They are provided as references. Note that we did not use these data during training.
  • StarGAN-VC is the converted speech samples, in which the conventional StarGAN-VC [6] was used to convert MCEPs.
  • StarGAN-VC2 is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert MCEPs.
  • StarGAN-VC2++ is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert all acoustic features (namely, MCEPs, band APs, continuous log F0, and voice/unvoice indicator).

Male (VCC2SM1) → Female (VCC2SF1)

Source Target StarGAN-VC
(Conventional)
StarGAN-VC2
(Proposed)
StarGAN-VC2++
(Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2SF2) → Male (VCC2SM2)

Source Target StarGAN-VC
(Conventional)
StarGAN-VC2
(Proposed)
StarGAN-VC2++
(Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2SM2) → Male (VCC2SM1)

Source Target StarGAN-VC
(Conventional)
StarGAN-VC2
(Proposed)
StarGAN-VC2++
(Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2SF1) → Female (VCC2SF2)

Source Target StarGAN-VC
(Conventional)
StarGAN-VC2
(Proposed)
StarGAN-VC2++
(Proposed)
Sample 1
Sample 2
Sample 3

2. One-to-multi VC using StarGAN-VC2

Male (VCC2SM1) → Male (VCC2SM2), Female (VCC2SF1), or Female (VCC2SF2)

Real speech VCC2SM1 VCC2SM2 VCC2SF1 VCC2SF2
Reference
Converted speech VCC2SM1
(Source)
VCC2SM2
(Converted)
VCC2SF1
(Converted)
VCC2SF2
(Converted)
Sample 1
Sample 2
Sample 3

Female (VCC2SF1) → Female (VCC2SF2), Male (VCC2SM1), or Male (VCC2SM2)

Real speech VCC2SF1 VCC2SF2 VCC2SM1 VCC2SM2
Reference
Converted speech VCC2SF1
(Source)
VCC2SF2
(Converted)
VCC2SM1
(Converted)
VCC2SM2
(Converted)
Sample 1
Sample 2
Sample 3

3. Voice morphing using StarGAN-VC2

We morph voices between two speakers using StarGAN-VC2.
Our high-quality multi-domain VC framework makes it possible to conduct natural voice morphing.

Source Mixture rate
3:1
Mixture rate
1:1
Mixture rate
1:3
Target
VCC2SF1
→VCC2SF2
VCC2SF1
→VCC2SM1
VCC2SF1
→VCC2SM2


References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. arXiv:1806.02169, June 2018 (SLT, 2018). [Paper] [Project]

[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Speaker Odyssey, 2018. [Paper] [Dataset]

[3] K. Kobayashi and T. Toda. sprocket: Open-Source Voice Conversion Software. Speaker Odyssey, 2018. [Paper] [Project] [Samples (zip)]

[4] T. Kaneko and H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]

[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]

[6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel Many-to-Many Voice Conversion with Auxiliary Classifier Variational Autoencoder. arXiv:1808.05092, Aug. 2018. (IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019). [Paper] [Project]

[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks. Interspeech, 2017. [Paper] [Project]