StarGAN-VC: Nonparallel many-to-many voice conversion using star generative adversarial networks

Papers

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," in Proc. 2018 IEEE Workshop on Spoken Language Technology (SLT 2018), pp. 266-273, Dec. 2018.
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo, "StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion," in Proc. The 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), pp. 679-683, Sep. 2019.
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2982-2995, 2020.

StarGAN-VC

StarGAN-VC [1][2][3] is a nonparallel many-to-many voice conversion (VC) method using star generative adversarial networks (StarGAN) [4]. The current version performs VC by first modifying the mel-spectrogram of input speech of an arbitrary speaker in accordance with a target speaker index, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN [5] or Parallel WaveGAN [6]) [7] from the modified mel-spectrogram. There are several variations of the StarGAN formulation depending on the definition of the adversarial loss [1][2][3]. These include the Wasserstein [8], cross-entropy [9], and least-squares [10] losses. The following are audio examples of these three versions of StarGAN-VC.

Links to related pages

Please also refer to the following web sites.

Source codes
Versions 1 and 2 of StarGAN-VC (applied to mel-cepstral sequences)
ACVAE-VC (applied to mel-spectrograms)
Links to my other work

Audio examples

Speaker identity conversion

Here are audio examples of speaker identity conversion tested on the CMU Arctic database [11], which consists of 1132 phonetically balanced English utterances spoken by two female speakers ("clb" and "slt") and two male speakers ("bdl" and "rms"). The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for StarGAN-VC, HiFi-GAN (HFG), and Parallel WaveGAN (PWG). WGAN, CGAN, and LSGAN stand for the StarGAN formulations with the Wasserstein, cross-entropy, and least-squares losses, respectively.

Change sentence (current: #b0408)

Input	Target	StarGAN-VC
		WGAN		CGAN		LSGAN
		HFG	PWG	HFG	PWG	HFG	PWG
clb	bdl
	slt
	rms
bdl	clb
	slt
	rms
slt	clb
	bdl
	rms
rms	clb
	bdl
	slt

Electrolaryngeal speech enhancement

Here are some audio examples of electrolaryngeal-to-normal speech conversion. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. The aim of this task is to improve perceived naturalness of EL speech by converting it into a normal version. Since EL speech usually has a flat pitch contour, the challenge is how to predict natural intonation from it. The utterances of two female speakers ("fkn" and "fks") and two male speakers ("mho" and "mht") reading the first 450 sentences of the ATR speech database were used as the training data for target normal speech. The utterances of EL speech reading the same sentences were used as the training data for source EL speech. Audio examples are provided below.

Change sentence (current: #J01)

Input	Target	StarGAN-VC
		WGAN		CGAN		LSGAN
		HFG	PWG	HFG	PWG	HFG	PWG
EL speech	fkn
	fks
	mho
	mht

Whisper-to-Natural Speech Conversion

Here are some audio examples of whisper-to-normal speech conversion. Whispered speech can be useful in quiet, private online communication, where the speaker does not want his or her voice to be heard by nearby listeners, but it suffers from lower intelligibility than normal speech. The aim of this task is to convert whispered speech into a normal version so that it becomes easier to hear. Similar to the EL-to-natural speech conversion task, input speech does not contain pitch information. Hence, the challenge is again how to predict natural intonation from unpitched speech. Audio examples are provided below.

Change sentence (current: #j01)

Input	Target	StarGAN-VC
		WGAN		CGAN		LSGAN
		HFG	PWG	HFG	PWG	HFG	PWG
whisper	normal

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," in Proc. SLT, 2018, pp. 266-273.

[2] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion," in Proc. Interspeech, 2019, pp. 679-683.

[3] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "Nonparallel voice conversion with augmented classifier star generative adversarial networks," IEEE/ACM Trans. ASLP, vol. 28, pp. 2982-2995, 2020.

[4] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, "StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation," in Proc. CVPR, 2018, pp. 8789-8797.

[5] J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," in Adv. NeurIPS, 2020, pp. 17022-17033.

[6] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[7] https://github.com/kan-bayashi/ParallelWaveGAN

[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of Wasserstein GANs," in Adv. NeurIPS, 2017, pp. 5769–5779.

[9] I. Goodfellow et al., "Generative adversarial nets," in Adv. NeurIPS, 2014, pp. 2672–2680.

[10] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, "Least squares generative adversarial networks," in Proc. ICCV, 2017, pp. 2794-2802.

[11] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

StarGAN-VC

Nonparallel many-to-many voice conversion using star generative adversarial networks

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Shogo Seki

NTT Communication Science Laboratories, NTT Corporation

Papers

StarGAN-VC

Links to related pages

Audio examples

Speaker identity conversion

Electrolaryngeal speech enhancement

Whisper-to-Natural Speech Conversion

References