Papers
  • Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.
ACVAE-VC

Here we demonstrate audio examples generated by ACVAE-VC applied to mel-spectrograms (instead of mel-cepstral sequences). After mel-spectrogram conversion, Parallel WaveGAN [2][3] is used to generate waveforms from the converted mel-spectrograms.

Links to related pages

Please also refer to the following web sites for comparison.

Audio examples
Speaker identity conversion

Here are some audio examples of speaker identity conversion tested on the CMU Arctic database [4], which consists of 1132 phonetically balanced English utterances spoken by four US English speakers. The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ACVAE and Parallel WaveGAN training. For comparison, audio examples obtained with sprocket [5], an open-source parallel VC method, are also provided.


ACVAE-VC
(melspec)
Converted to ...
clb bdl slt rms
Input clb
bdl
slt
rms

sprocket [2] Converted to ...
clb bdl slt rms
Input clb
bdl
slt
rms



Electrolaryngeal speech enhancement

Here are some audio examples of electrolaryngeal-to-normal speech conversion. Loss of voice after a laryngectomy can lead to a considerable decrease in the quality of life. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. Here, we used ACVAE-VC2 to perform EL speech enhancement with the aim of improving perceived naturalness of EL speech. Audio examples obtained with ACVAE-VC2 and sprocket [5] are provided below. Since sprocket is designed to only adjust the mean and variance of the log F0 contour of input speech, the generated speech samples also have flat pitch contours.

ACVAE-VC
(melspec)
Converted
(normal)
Input
(EL)

sprocket [5] Converted
(normal)
Input
(EL)
References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

[2] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[3] https://github.com/kan-bayashi/ParallelWaveGAN

[4] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

[5] K. Kobayashi and T. Toda, "sprocket: Open-source voice conversion software," in Proc. Odyssey, 2018, pp. 203–210.