Papers
  • Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, "ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.
ACVAE-VC

ACVAE-VC [1] is a non-parallel many-to-many voice conversion (VC) method using an auxiliary classifier variational autoencoder (ACVAE). The current version performs VC by first modifying the mel-spectrogram of input speech, and then generating a waveform using a speaker-independent neural vocoder (HiFi-GAN [2] or Parallel WaveGAN [3]) [4] from the modified spectrogram. The following are audio examples of the current version of ACVAE-VC with convolutional and recurrent architectures.

Links to related pages

Please also refer to the following web sites.

Audio examples
Speaker identity conversion

Here are audio examples of speaker identity conversion tested on the CMU Arctic database [5], which consists of 1132 phonetically balanced English utterances spoken by two female speakers ("clb" and "slt") and two male speakers ("bdl" and "rms"). The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ACVAE-VC, HiFi-GAN (HFG), and Parallel WaveGAN (PWG). "Conv" and "RNN" stand for ACVAE-VC with fully convolutional and recurrent architectures, respectively.



Change sentence (current: #b0408)
Input
Target
ACVAE-VC
CNN RNN
HFG PWG HFG PWG
clb
bdl
slt
rms
bdl
clb
slt
rms
slt
clb
bdl
rms
rms
clb
bdl
slt



Electrolaryngeal speech enhancement

Here are some audio examples of electrolaryngeal-to-normal speech conversion. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. The aim of this task is to improve perceived naturalness of EL speech by converting it into a normal version. Since EL speech usually has a flat pitch contour, the challenge is how to predict natural intonation from it. The utterances of two female speakers ("fkn" and "fks") and two male speakers ("mho" and "mht") reading the first 450 sentences of the ATR speech database were used as the training data for target normal speech. The utterances of EL speech reading the same sentences were used as the training data for source EL speech. Audio examples are provided below.


Change sentence (current: #J01)
Input
Target
ACVAE-VC
CNN RNN
HFG PWG HFG PWG
EL
speech
fkn
fks
mho
mht


Whisper-to-Natural Speech Conversion

Here are some audio examples of whisper-to-normal speech conversion. Whispered speech can be useful in quiet, private online communication, where the speaker does not want his or her voice to be heard by nearby listeners, but it suffers from lower intelligibility than normal speech. The aim of this task is to convert whispered speech into a normal version so that it becomes easier to hear. Similar to the EL-to-natural speech conversion task, input speech does not contain pitch information. Hence, the challenge is again how to predict natural intonation from unpitched speech. Audio examples are provided below.


Change sentence (current: #j01)
Input
Target
ACVAE-VC
CNN RNN
HFG PWG HFG PWG
whisper
normal

References

[1] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432-1443, Sep. 2019.

[2] J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis," in Adv. NeurIPS, 2020, pp. 17022-17033.

[3] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[4] https://github.com/kan-bayashi/ParallelWaveGAN

[5] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.