Please check out our relevant work!
Previous work:- StarGAN-VC: StarGAN-VC
- CycleGAN-VCs: CycleGAN-VC, CycleGAN-VC2, CycleGAN-VC3, MaskCycleGAN-VC (Latest)
- Other VCs: Links to demo pages
StarGAN-VC2
To advance the research on multi-domain non-parallel VC, we rethink conditional methods in StarGAN-VC [1] and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner.

Converted speech samples
Task and dataset
- We evaluated our method on the non-parallel multi-speaker VC task.
- We used the Voice Conversion Challenge 2018 (VCC 2018) dataset [2] in which we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2.
- Each speaker has 81 sentences (about 5 minutes) for training. This is relatively little for VC.
- Our goal is to learn 4 x 3 = 12 different source-and-target mappings only using a single model.
- Note that we did not use any extra data, module, or time alignment procedures for training.
Results
We summarize the results in three ways:- 1. Comparison between StarGAN-VC and StarGAN-VC2
- 2. One-to-multi VC using StarGAN-VC2
- 3. Voice morphing using StarGAN-VC2
NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.
1. Comparison between StarGAN-VC [1] and StarGAN-VC2
Notation
- Source is the source speech samples.
- Target is the target speech samples. They are provided as references. Note that we did not use these data during training.
- StarGAN-VC is the converted speech samples, in which the conventional StarGAN-VC [6] was used to convert MCEPs.
- StarGAN-VC2 is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert MCEPs.
- StarGAN-VC2++ is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert all acoustic features (namely, MCEPs, band APs, continuous log F0, and voice/unvoice indicator).
Male (VCC2SM1) → Female (VCC2SF1)
Source | Target | StarGAN-VC (Conventional) |
StarGAN-VC2 (Proposed) |
StarGAN-VC2++ (Proposed) |
|
---|---|---|---|---|---|
Sample 1 | |||||
Sample 2 | |||||
Sample 3 |
Female (VCC2SF2) → Male (VCC2SM2)
Source | Target | StarGAN-VC (Conventional) |
StarGAN-VC2 (Proposed) |
StarGAN-VC2++ (Proposed) |
|
---|---|---|---|---|---|
Sample 1 | |||||
Sample 2 | |||||
Sample 3 |
Male (VCC2SM2) → Male (VCC2SM1)
Source | Target | StarGAN-VC (Conventional) |
StarGAN-VC2 (Proposed) |
StarGAN-VC2++ (Proposed) |
|
---|---|---|---|---|---|
Sample 1 | |||||
Sample 2 | |||||
Sample 3 |
Female (VCC2SF1) → Female (VCC2SF2)
Source | Target | StarGAN-VC (Conventional) |
StarGAN-VC2 (Proposed) |
StarGAN-VC2++ (Proposed) |
|
---|---|---|---|---|---|
Sample 1 | |||||
Sample 2 | |||||
Sample 3 |
2. One-to-multi VC using StarGAN-VC2
Male (VCC2SM1) → Male (VCC2SM2), Female (VCC2SF1), or Female (VCC2SF2)
Real speech | VCC2SM1 | VCC2SM2 | VCC2SF1 | VCC2SF2 |
---|---|---|---|---|
Reference |
Converted speech | VCC2SM1 (Source) |
VCC2SM2 (Converted) |
VCC2SF1 (Converted) |
VCC2SF2 (Converted) |
---|---|---|---|---|
Sample 1 | ||||
Sample 2 | ||||
Sample 3 |
Female (VCC2SF1) → Female (VCC2SF2), Male (VCC2SM1), or Male (VCC2SM2)
Real speech | VCC2SF1 | VCC2SF2 | VCC2SM1 | VCC2SM2 |
---|---|---|---|---|
Reference |
Converted speech | VCC2SF1 (Source) |
VCC2SF2 (Converted) |
VCC2SM1 (Converted) |
VCC2SM2 (Converted) |
---|---|---|---|---|
Sample 1 | ||||
Sample 2 | ||||
Sample 3 |
3. Voice morphing using StarGAN-VC2
We morph voices between two speakers using StarGAN-VC2.
Our high-quality multi-domain VC framework makes it possible to conduct natural voice morphing.
Source | Mixture rate 3:1 |
Mixture rate 1:1 |
Mixture rate 1:3 |
Target | |
---|---|---|---|---|---|
VCC2SF1 →VCC2SF2 |
|||||
VCC2SF1 →VCC2SM1 |
|||||
VCC2SF1 →VCC2SM2 |
References
[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. arXiv:1806.02169, June 2018 (SLT, 2018). [Paper] [Project]
[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Speaker Odyssey, 2018. [Paper] [Dataset]
[3] K. Kobayashi and T. Toda. sprocket: Open-Source Voice Conversion Software. Speaker Odyssey, 2018. [Paper] [Project] [Samples (zip)]
[4] T. Kaneko and H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]
[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]
[6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel Many-to-Many Voice Conversion with Auxiliary Classifier Variational Autoencoder. arXiv:1808.05092, Aug. 2018. (IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019). [Paper] [Project]
[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks. Interspeech, 2017. [Paper] [Project]