ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Mel-Spectrogram Conversion

Papers

Hirokazu Kameoka, Kou Tanaka, and Takuhiro Kaneko, "FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion," arXiv:2104.06900 [cs.SD], 2021. (PDF)
Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo, "ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1849-1863, Jun. 2020. (IEEE Xplore)

ConvS2S-VC (melspec)

Here we demonstrate audio examples generated by ConvS2S-VC[1] applied to mel-spectrograms (instaed of mel-cepstral sequences). After mel-spectrogram conversion, Parallel WaveGAN [2][3] is used to generate waveforms from the converted mel-spectrograms.

architecture1 — S2S-VC architecture
(Click to enlarge)

architecture2 — S2S-VC architecture
(Click to enlarge)

Links to related pages

Please also refer to the following web sites for comparison.

Audio examples

Speaker identity conversion

Here are some audio examples of speaker identity conversion using the CMU Arctic database [4], which consists of 1132 phonetically balanced English utterances spoken by two female speakers ("clb" and "slt") and two male speakers ("bdl" and "rms"). The audio files for each speaker were manually divided into sets of 1000 and 132 files, and the first set was provided as the training set for ConvS2S and Parallel WaveGAN training. Audio examples obtained with ConvS2S-VC applied to mel-spectrograms and mel-cepstral sequences and the open-source VC system called "sprocket" [5] are demonstrated below.

ConvS2S-VC (melspec)		Converted to ...
ConvS2S-VC (melspec)		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

ConvS2S-VC (melcep) [1]		Converted to ...
ConvS2S-VC (melcep) [1]		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

sprocket [2]		Converted to ...
sprocket [2]		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

Emotional expression conversion

Here are some audio examples of voice emotional expression conversion. All the training and test data used in this experiment consisted of Japanese utterances spoken by one female speaker with neutral (ntl), angry (ang), sad (sad) and happy (hap) expressions. Audio examples obtained with ConvS2S-VC applied to mel-spectrograms and mel-cepstral sequences and sprocket [5] are provided below.

ConvS2S-VC (melspec)		Converted to ...
ConvS2S-VC (melspec)		ntl	ang	sad	hap
Input	ntl	—
	ang		—
	sad			—
	hap				—

ConvS2S-VC (melcep) [1]		Converted to ...
ConvS2S-VC (melcep) [1]		ntl	ang	sad	hap
Input	ntl	—
	ang		—
	sad			—
	hap				—

sprocket [5]		Converted to ...
sprocket [5]		ntl	ang	sad	hap
Input	ntl	—
	ang		—
	sad			—
	hap				—

References

[1] H. Kameoka, K. Tanaka, D. Kwasny, T. Kaneko, and N. Hojo, "ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion," IEEE/ACM Trans. ASLP, vol. 28, pp. 1849-1863, Jun. 2020.

[2] R. Yamamoto, E. Song, and J. Kim, "Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram," in Proc. ICASSP, 2020, pp. 6199-6203.

[3] https://github.com/kan-bayashi/ParallelWaveGAN

[4] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

[5] K. Kobayashi and T. Toda, "sprocket: Open-source voice conversion software," in Proc. Odyssey, 2018, pp. 203–210.

ConvS2S-VC

Fully Convolutional Sequence-to-Sequence
Mel-Spectrogram Conversion

Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

NTT Communication Science Laboratories, NTT Corporation

Papers

ConvS2S-VC (melspec)

Links to related pages

Audio examples

Speaker identity conversion

Emotional expression conversion

References


S2S-VC architecture (Click to enlarge)	(a) RNN, (b) ConvS2S, and (c) Transformer architectures (Click to enlarge)

ConvS2S-VC

Fully Convolutional Sequence-to-SequenceMel-Spectrogram Conversion

Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

NTT Communication Science Laboratories, NTT Corporation

Papers

ConvS2S-VC (melspec)

Links to related pages

Audio examples

Speaker identity conversion

Emotional expression conversion

References

Fully Convolutional Sequence-to-Sequence
Mel-Spectrogram Conversion