Papers
  • Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, "ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder," arXiv:1808.05092 [stat.ML], Aug. 2018.
ACVAE-VC

ACVAE-VC is a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). This method has three key features. First, it adopts fully convolutional architectures to devise the encoder and decoder networks so that the networks can learn conversion rules that capture time dependencies in the acoustic feature sequences of source and target speech. Second, to enhance the conversion effect, it uses an information-theoretic regularization for the model training to ensure that the information in the attribute class label will not be lost in the conversion process. With regular CVAEs, the encoder and decoder are free to ignore the attribute class label input. This can be problematic since in such a situation, the attribute class label will have little effect on controlling the voice characteristics of input speech at test time. Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. Third, it avoids producing buzzy-sounding speech at test time by simply transplanting the spectral details of input speech into its converted version. We found that this simple method works reasonably well, as demonstrated below.

acvae-vc
Figure 1. Illustration of ACVAE-VC.
Links to related pages

Please also refer to the following web sites for comparison.

Audio examples

Here, we demonstrate audio examples of ACVAE-VC tested on a non-parallel many-to-many speaker identity conversion task. We selected speech of two female speakers, 'SF1' and 'SF2', and two male speakers, 'SM1' and 'SM2', from the Voice Conversion Challenge (VCC) 2018 dataset [1] for training and evaluation. Thus, there were twelve different combinations of source and target speakers in total. The audio files for each speaker were manually segmented into 116 short sentences (about 7 minutes) where 81 and 35 sentences (about 5 and 2 minutes) were provided as training and evaluation sets, respectively. Subjective evaluation experiments revealed that ACVAE-VC obtained higher sound quality and speaker similarity than a state-ofthe-art method based on variational autoencoding GANs [2]. Audio examples of the VAEGAN approach are demonstrated at the authors' website [3].

SM1 (Male) → SF1 (Female)
Source speech example Target speech example
Input speech Converted speech (ACVAE-VC)
SF2 (Female) → SM2 (Male)
Source speech example Target speech example
Input speech Converted speech (ACVAE-VC)
SM2 (Male) → SM1 (Male)
Source speech example Target speech example
Input speech Converted speech (ACVAE-VC)
SF1 (Female) → SF2 (Female)
Source speech example Target speech example
Input speech Converted speech (ACVAE-VC)
References

[1] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, "The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods," arXiv:1804.04262 [eess.AS], Apr. 2018.

[2] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. The Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 3364–3368.

[3] https://jeremycchsu.github.io/vc-vawgan/