ACVAE-VC: Auxiliary Classifier Variational Autoencoder Mel-Spectrogram Conversion

Papers

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

ACVAE-VC

ACVAE-VC [1] is a non-parallel many-to-many voice conversion (VC) method using an auxiliary classifier variational autoencoder (ACVAE). The current version performs VC by first modifying the mel-spectrogram of input speech, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN [2] or Parallel WaveGAN [3]) [4] from the modified spectrogram. The following are audio examples of the current version of ACVAE-VC with convolutional and recurrent architectures.

Links to related pages

Please also refer to the following web sites.

Source codes
Previous version of ACVAE-VC (applied to mel-cepstral sequences)
StarGAN-VC (applied to mel-spectrograms)
Links to my other work

Audio examples

Speaker identity conversion

Here are audio examples of speaker identity conversion tested on the CMU Arctic database [5], which consists of 1132 phonetically balanced English utterances spoken by two female speakers ("clb" and "slt") and two male speakers ("bdl" and "rms"). The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ACVAE-VC, HiFi-GAN (HFG), and Parallel WaveGAN (PWG). "Conv" and "RNN" stand for ACVAE-VC with fully convolutional and recurrent architectures, respectively.

Input	Target	ACVAE-VC
Change sentence (current: #b0408)
		CNN		RNN
		HFG	PWG	HFG	PWG
clb	bdl
	slt
	rms
bdl	clb
	slt
	rms
slt	clb
	bdl
	rms
rms	clb
	bdl
	slt

Electrolaryngeal speech enhancement

Here are some audio examples of electrolaryngeal-to-normal speech conversion. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. The aim of this task is to improve perceived naturalness of EL speech by converting it into a normal version. Since EL speech usually has a flat pitch contour, the challenge is how to predict natural intonation from it. The utterances of two female speakers ("fkn" and "fks") and two male speakers ("mho" and "mht") reading the first 450 sentences of the ATR speech database were used as the training data for target normal speech. The utterances of EL speech reading the same sentences were used as the training data for source EL speech. Audio examples are provided below.

Input	Target	ACVAE-VC
Change sentence (current: #J01)
		CNN		RNN
		HFG	PWG	HFG	PWG
EL speech	fkn
	fks
	mho
	mht

Whisper-to-Natural Speech Conversion

Here are some audio examples of whisper-to-normal speech conversion. Whispered speech can be useful in quiet, private online communication, where the speaker does not want his or her voice to be heard by nearby listeners, but it suffers from lower intelligibility than normal speech. The aim of this task is to convert whispered speech into a normal version so that it becomes easier to hear. Similar to the EL-to-natural speech conversion task, input speech does not contain pitch information. Hence, the challenge is again how to predict natural intonation from unpitched speech. Audio examples are provided below.

Input	Target	ACVAE-VC
Change sentence (current: #j01)
		CNN		RNN
		HFG	PWG	HFG	PWG
whisper	normal

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

[2] J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," in Adv. NeurIPS, 2020, pp. 17022-17033.

[3] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[4] https://github.com/kan-bayashi/ParallelWaveGAN

[5] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

ACVAE-VC

Auxiliary Classifier Variational Autoencoder Voice Conversion

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Shogo Seki

NTT Communication Science Laboratories, NTT Corporation

Papers

ACVAE-VC

Links to related pages

Audio examples

Speaker identity conversion

Electrolaryngeal speech enhancement

Whisper-to-Natural Speech Conversion

References