ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion

Papers

Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, "ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion," arXiv:1811.01609 [cs.SD], Nov. 2018.
Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo, "ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1849-1863, Jun. 2020. (IEEE Xplore)

ConvS2S-VC

ConvS2S-VC is a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning. It learns the mapping between source and target speech feature sequences using a fully convolutional seq2seq model with an attention mechanism. Owing to the nature of seq2seq learning, our method is particularly noteworthy in that it allows the flexible conversion of not only the voice characteristics but also the pitch contour and duration of the input speech. The current model consists of five networks, namely source and target encoders, a target decoder, and source and target reconstructors, which are designed using dilated causal convolution networks with gated linear units.

acvae-vc — Figure 1. Model structure of ConvS2S-VC.

Links to related pages

Please also refer to the following web sites for comparison.

Links to my other work

Audio examples

Speaker identity conversion

Here are some audio examples of ConvS2S-VC tested on a speaker identity conversion task. In this experiment, we used the CMU Arctic database [1], which consists of 1132 phonetically balanced English utterances spoken by four US English speakers. We selected "clb" (female) and "rms" (male) as the source speakers and "slt" (female) and "bdl" (male) as the target speakers. The audio files for each speaker were manually divided into 1000 and 132 files, which were provided as training and evaluation sets, respectively. For each utterance, 28 mel-cepstral coefficients (MCEPs), log F0, aperiodicity, and voiced/unvoiced information were extracted every 8 ms using the WORLD analyzer. Audio examples obtained with ConvS2S-VC, its RNN counterpart, and the open-source VC system called "sprocket" [2] are demonstrated below.

ConvS2S-VC		Converted to ...
ConvS2S-VC		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

sprocket[2]		Converted to ...
sprocket[2]		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

RNN-S2S-VC		Converted to ...
RNN-S2S-VC		clb	bdl	slt	rms
Input	clb	—
	bdl		—
	slt			—
	rms				—

Emotional expression conversion

Here are some audio examples of ConvS2S-VC tested on a voice emotional expression conversion task. All the training and test data used in this experiment consisted of Japanese utterances spoken by one female speaker with neutral (ntl), angry (ang), sad (sad) and happy (hap) expressions. Audio examples obtained with sprocket [2] are also provided below. As these examples show, ConvS2S-VC was able to convert the expressions more naturally than sprocket.

ConvS2S-VC		Converted to ...
ConvS2S-VC		ntl	ang	sad	hap
Input	ntl	—
	ang		—
	sad			—
	hap				—

sprocket[2]		Converted to ...
sprocket[2]		ntl	ang	sad	hap
Input	ntl	—
	ang		—
	sad			—
	hap				—

Electrolaryngeal speech enhancement

Here are some audio examples of ConvS2S-VC tested on an electrolaryngeal-to-normal speech conversion task. Loss of voice after a laryngectomy can lead to a considerable decrease in the quality of life. Electrolaryngeal (EL) speech is a method of voice restoration using electrolarynx, which is a battery-operated machine that produces sound to create a voice. Here, we used ConvS2S-VC to perform EL speech enhancement with the aim of improving perceived naturalness of EL speech. Audio examples obtained with sprocket [2] are also provided below. Since sprocket is designed to only adjust the mean and variance of the log F0 contour of input speech, the generated speech samples also have flat pitch contours.

ConvS2S-VC		Converted
ConvS2S-VC		(normal)
Input (EL)

sprocket[2]		Converted
sprocket[2]		(normal)
Input (EL)

English accent conversion

Suprasegmental patterns in speech are important factors that characterize accents in diverse languages. Owing to the nature of S2S learning, ConvS2S-VC is able to capture and convert suprasegmental features contained in input speech. Here are some audio examples of ConvS2S-VC tested on an English accent conversion task where the voice of an Indian English speaker ("ksp") is converted to that of two American English speakers ("slt" and "rms") included in the CMU Arctic database [1]. Audio examples obtained with sprocket [2], which is designed to convert segmental features only, are also provided. As these examples show, ConvS2S-VC was able to convert the accent of the Indian English speaker more successfully than sprocket.

ConvS2S-VC		Converted to ...
ConvS2S-VC		slt	rms
Input	ksp

sprocket[2]		Converted to ...
sprocket[2]		slt	rms
Input	ksp

References

[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Proc. SSW, 2004, pp. 223–224.

[2] K. Kobayashi and T. Toda, “sprocket: Open-source voice conversion software,” in Proc. Odyssey, 2018, pp. 203–210.

ConvS2S-VC

Fully Convolutional Sequence-to-Sequence Voice Conversion

Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo

NTT Communication Science Laboratories, NTT Corporation

Papers

ConvS2S-VC

Links to related pages

Audio examples

Speaker identity conversion

Emotional expression conversion

Electrolaryngeal speech enhancement

English accent conversion

References