ACVAE-VC: Auxiliary Classifier Variational Autoencoder Mel-Spectrogram Conversion

Papers

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

ACVAE-VC

Here we demonstrate audio examples generated by ACVAE-VC applied to mel-spectrograms (instead of mel-cepstral sequences). After mel-spectrogram conversion, Parallel WaveGAN [2][3] is used to generate waveforms from the converted mel-spectrograms.

Links to related pages

Please also refer to the following web sites for comparison.

Audio examples

Speaker identity conversion

Here are some audio examples of speaker identity conversion tested on the CMU Arctic database [4], which consists of 1132 phonetically balanced English utterances spoken by four US English speakers. The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ACVAE and Parallel WaveGAN training. For comparison, audio examples obtained with sprocket [5], an open-source parallel VC method, are also provided.

ACVAE-VC (melspec)		Converted to ...
ACVAE-VC (melspec)		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

sprocket [2]		Converted to ...
sprocket [2]		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

Electrolaryngeal speech enhancement

Here are some audio examples of electrolaryngeal-to-normal speech conversion. Loss of voice after a laryngectomy can lead to a considerable decrease in the quality of life. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. Here, we used ACVAE-VC2 to perform EL speech enhancement with the aim of improving perceived naturalness of EL speech. Audio examples obtained with ACVAE-VC2 and sprocket [5] are provided below. Since sprocket is designed to only adjust the mean and variance of the log F0 contour of input speech, the generated speech samples also have flat pitch contours.

ACVAE-VC (melspec)		Converted
ACVAE-VC (melspec)		(normal)
Input (EL)

sprocket [5]		Converted
sprocket [5]		(normal)
Input (EL)

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

[2] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[3] https://github.com/kan-bayashi/ParallelWaveGAN

[4] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

[5] K. Kobayashi and T. Toda, "sprocket: Open-source voice conversion software," in Proc. Odyssey, 2018, pp. 203–210.

ACVAE-VC

Auxiliary Classifier Variational Autoencoder
Mel-Spectrogram Conversion

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

NTT Communication Science Laboratories, NTT Corporation

Papers

ACVAE-VC

Links to related pages

Audio examples

Speaker identity conversion

Electrolaryngeal speech enhancement

References

ACVAE-VC

Auxiliary Classifier Variational AutoencoderMel-Spectrogram Conversion

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

NTT Communication Science Laboratories, NTT Corporation

Papers

ACVAE-VC

Links to related pages

Audio examples

Speaker identity conversion

Electrolaryngeal speech enhancement

References

Auxiliary Classifier Variational Autoencoder
Mel-Spectrogram Conversion