Papers
  • Hirokazu Kameoka, Kou Tanaka, and Takuhiro Kaneko, "FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion," arXiv:2104.06900 [cs.SD], 2021. (PDF)
  • Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo, "ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1849-1863, Jun. 2020. (IEEE Xplore)
ConvS2S-VC (melspec)

Here we demonstrate audio examples generated by ConvS2S-VC[1] applied to mel-spectrograms (instaed of mel-cepstral sequences). After mel-spectrogram conversion, Parallel WaveGAN [2][3] is used to generate waveforms from the converted mel-spectrograms.

architecture1 architecture2
S2S-VC architecture
(Click to enlarge)
(a) RNN, (b) ConvS2S, and (c) Transformer architectures
(Click to enlarge)
Links to related pages
Audio examples
Speaker identity conversion

Here are some audio examples of speaker identity conversion using the CMU Arctic database [4], which consists of 1132 phonetically balanced English utterances spoken by two female speakers ("clb" and "slt") and two male speakers ("bdl" and "rms"). The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ConvS2S and Parallel WaveGAN training. Audio examples obtained with ConvS2S-VC applied to mel-spectrograms and mel-cepstral sequences and the open-source VC system called "sprocket" [5] are demonstrated below.


ConvS2S-VC
(melspec)
Converted to ...
clb bdl slt rms
Input clb
bdl
slt
rms

ConvS2S-VC
(melcep)
[1]
Converted to ...
clb bdl slt rms
Input clb
bdl
slt
rms

sprocket [2] Converted to ...
clb bdl slt rms
Input clb
bdl
slt
rms



Emotional expression conversion

Here are some audio examples of voice emotional expression conversion. All the training and test data used in this experiment consisted of Japanese utterances spoken by one female speaker with neutral (ntl), angry (ang), sad (sad) and happy (hap) expressions. Audio examples obtained with ConvS2S-VC applied to mel-spectrograms and mel-cepstral sequences and sprocket [5] are provided below.

ConvS2S-VC
(melspec)
Converted to ...
ntl ang sad hap
Input ntl
ang
sad
hap

ConvS2S-VC
(melcep)
[1]
Converted to ...
ntl ang sad hap
Input ntl
ang
sad
hap

sprocket [5] Converted to ...
ntl ang sad hap
Input ntl
ang
sad
hap
References

[1] H. Kameoka, K. Tanaka, D. Kwasny, T. Kaneko, and N. Hojo, "ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion," IEEE/ACM Trans. ASLP, vol. 28, pp. 1849-1863, Jun. 2020.

[2] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[3] https://github.com/kan-bayashi/ParallelWaveGAN

[4] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

[5] K. Kobayashi and T. Toda, "sprocket: Open-source voice conversion software," in Proc. Odyssey, 2018, pp. 203–210.