Paper

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo
PDF

English audio samples

Systems

GMM-VC-wGV

Conventional GMM-based voice conversion considering global varoance [1]

LSTM-TTS

LSTM-based text-to-speech synthesis
(instead of conventional Seq2Seq voice conversion using context information [2])

Bonus

Our method applied to mel-spectrograms rather than WORLD vocoder features

Anasyn1

Just analysis and synthesis via WORLD vocoder features, namely no conversion

Anasyn2

Just analysis and synthesis via mel-spectrograms, namely no conversion

Database

CMU Arctic Databases [3]

Training: 1000 sentences
Evaluation: 132 sentences

Cross gender: clb (female) --> bdl (male)
(Supported: Safari, Chrome, FireFox, Opera)
Natural WORLD vocoder [4] Phase vocoder [5]

Source Target Proposed LSTM-TTS GMM-VC-wGV Anasyn1 Bonus Anasyn2

Intra gender: clb (female) --> slt (female)

Intra gender: rms (male) --> bdl (male)

Cross gender: rms (male) --> slt (female)

Acceleration and Stabilization of Training Procedure

We varied hyperparamters \(\lambda_\mathrm{ga}\), \(\lambda_\mathrm{cpx}\), and \(\lambda_\mathrm{cpy}\) of our objective function,
Objective function \({\mathcal L}_\mathrm{proposed} = {\mathcal L}_\mathrm{Seq2Seq} + \lambda_\mathrm{ga} {\mathcal L}_\mathrm{ga} + \lambda_\mathrm{cpx} || \tilde{X} - X ||_1 + \lambda_\mathrm{cpy} || \tilde{Y} - Y ||_1\)

Source speech:
Target speech:

Conversion results after 1,000 epochs training
(Supported: Safari, Chrome, FireFox, Opera)
\(\lambda_\mathrm{ga}\) \(\lambda_\mathrm{cpx}\) \(\lambda_\mathrm{cpy}\) Attention Converted

-- -- --

1
-- --

1
(Failed) -- --

10,000 10 10

Additional results
-- 10 --

-- -- 10

-- 10 10

10,000 -- --

10,000 10 --

10,000 -- 10

Conversion results after 1,000 epochs training
(Supported: Safari, Chrome, FireFox, Opera)
\(\lambda_\mathrm{ga}\)	\(\lambda_\mathrm{cpx}\)	\(\lambda_\mathrm{cpy}\)
--	--	--
1	--	--
1 (Failed)	--	--
10,000	10	10

Additional results
--	10	--
--	--	10
--	10	10
10,000	--	--
10,000	10	--
10,000	--	10

References

Tomoki Toda, Alan W. Black, and Keiichi Tokuda, "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," in IEEE Transactions on ASLP, 2007.
web page

Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities," in Proc. INTERSPEECH, Aug. 2017.
web page

John Kominek and Alan W Black, "The CMU Arctic Speech Databases," in Proc. 5th ISCA Speech Synthesis Workshop (SSW5), June 2004.
web page

Masanori Morise, Fumiya Yokomori, Kenji Ozawa, "WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications," in IEICE, 2016.
web page

Daniel W. Griffin and Jae S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," in IEEE Transactions on ASSP, 1984.
web page