StarGAN-VC2

Please check out our relevant work!

Previous work:

StarGAN-VC: StarGAN-VC

Other relevant work:

CycleGAN-VCs: CycleGAN-VC, CycleGAN-VC2, CycleGAN-VC3, MaskCycleGAN-VC (Latest)
Other VCs: Links to demo pages

Paper

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
Interspeech 2019 (arXiv:1907.12279, July 2019)
[Paper] [arXiv] [Poster] [BibTeX]

StarGAN-VC2

To advance the research on multi-domain non-parallel VC, we rethink conditional methods in StarGAN-VC [1] and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner.

Converted speech samples

Task and dataset

We evaluated our method on the non-parallel multi-speaker VC task.
We used the Voice Conversion Challenge 2018 (VCC 2018) dataset [2] in which we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2.
Each speaker has 81 sentences (about 5 minutes) for training. This is relatively little for VC.
Our goal is to learn 4 x 3 = 12 different source-and-target mappings only using a single model.
Note that we did not use any extra data, module, or time alignment procedures for training.

Results

We summarize the results in three ways:

1. Comparison between StarGAN-VC and StarGAN-VC2
2. One-to-multi VC using StarGAN-VC2
3. Voice morphing using StarGAN-VC2

NOTE: Recommended browsers are Apple Safari, Google Chrome, or Mozilla Firefox.

1. Comparison between StarGAN-VC [1] and StarGAN-VC2

Notation

Source is the source speech samples.
Target is the target speech samples. They are provided as references. Note that we did not use these data during training.
StarGAN-VC is the converted speech samples, in which the conventional StarGAN-VC [6] was used to convert MCEPs.
StarGAN-VC2 is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert MCEPs.
StarGAN-VC2++ is the converted speech samples, in which the proposed StarGAN-VC2 was used to convert all acoustic features (namely, MCEPs, band APs, continuous log F₀, and voice/unvoice indicator).

Male (VCC2SM1) → Female (VCC2SF1)

	Source	Target	StarGAN-VC (Conventional)	StarGAN-VC2 (Proposed)	StarGAN-VC2++ (Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2SF2) → Male (VCC2SM2)

	Source	Target	StarGAN-VC (Conventional)	StarGAN-VC2 (Proposed)	StarGAN-VC2++ (Proposed)
Sample 1
Sample 2
Sample 3

Male (VCC2SM2) → Male (VCC2SM1)

	Source	Target	StarGAN-VC (Conventional)	StarGAN-VC2 (Proposed)	StarGAN-VC2++ (Proposed)
Sample 1
Sample 2
Sample 3

Female (VCC2SF1) → Female (VCC2SF2)

	Source	Target	StarGAN-VC (Conventional)	StarGAN-VC2 (Proposed)	StarGAN-VC2++ (Proposed)
Sample 1
Sample 2
Sample 3

2. One-to-multi VC using StarGAN-VC2

Male (VCC2SM1) → Male (VCC2SM2), Female (VCC2SF1), or Female (VCC2SF2)

Real speech	VCC2SM1	VCC2SM2	VCC2SF1	VCC2SF2
Reference

Converted speech	VCC2SM1 (Source)	VCC2SM2 (Converted)	VCC2SF1 (Converted)	VCC2SF2 (Converted)
Sample 1
Sample 2
Sample 3

Female (VCC2SF1) → Female (VCC2SF2), Male (VCC2SM1), or Male (VCC2SM2)

Real speech	VCC2SF1	VCC2SF2	VCC2SM1	VCC2SM2
Reference

Converted speech	VCC2SF1 (Source)	VCC2SF2 (Converted)	VCC2SM1 (Converted)	VCC2SM2 (Converted)
Sample 1
Sample 2
Sample 3

3. Voice morphing using StarGAN-VC2

We morph voices between two speakers using StarGAN-VC2.
Our high-quality multi-domain VC framework makes it possible to conduct natural voice morphing.

	Source	Mixture rate 3:1	Mixture rate 1:1	Mixture rate 1:3	Target
VCC2SF1 →VCC2SF2
VCC2SF1 →VCC2SM1
VCC2SF1 →VCC2SM2

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. StarGAN-VC: Non-parallel Many-to-Many Voice Conversion with Star Generative Adversarial Networks. arXiv:1806.02169, June 2018 (SLT, 2018). [Paper] [Project]

[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Speaker Odyssey, 2018. [Paper] [Dataset]

[3] K. Kobayashi and T. Toda. sprocket: Open-Source Voice Conversion Software. Speaker Odyssey, 2018. [Paper] [Project] [Samples (zip)]

[4] T. Kaneko and H. Kameoka. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. arXiv:1711.11293, Nov. 2017 (EUSIPCO, 2018). [Paper] [Project]

[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. ICASSP, 2019. [Paper] [Project]

[6] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ACVAE-VC: Non-parallel Many-to-Many Voice Conversion with Auxiliary Classifier Variational Autoencoder. arXiv:1808.05092, Aug. 2018. (IEEE/ACM Transactions on Audio, Speech, and Language Processing, May 2019). [Paper] [Project]

[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang. Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks. Interspeech, 2017. [Paper] [Project]

Please check out our relevant work!

Paper

StarGAN-VC2

Converted speech samples

Task and dataset

Results

1. Comparison between StarGAN-VC [1] and StarGAN-VC2

Notation

2. One-to-multi VC using StarGAN-VC2

3. Voice morphing using StarGAN-VC2

Collection of converted speech samples

References