AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo


English audio samples

Conventional GMM-based voice conversion considering global varoance [1]
LSTM-based text-to-speech synthesis
(instead of conventional Seq2Seq voice conversion using context information [2])
Our method applied to mel-spectrograms rather than WORLD vocoder features
Just analysis and synthesis via WORLD vocoder features, namely no conversion
Just analysis and synthesis via mel-spectrograms, namely no conversion
  CMU Arctic Databases [3]
Training: 1000 sentences
Evaluation: 132 sentences

Cross gender: clb (female) --> bdl (male)
(Supported: Safari, Chrome, FireFox, Opera)
Natural WORLD vocoder [4] Phase vocoder [5]
Source Target Proposed LSTM-TTS GMM-VC-wGV Anasyn1 Bonus Anasyn2

Intra gender: clb (female) --> slt (female)

Intra gender: rms (male) --> bdl (male)

Cross gender: rms (male) --> slt (female)

Acceleration and Stabilization of Training Procedure

We varied hyperparamters \(\lambda_\mathrm{ga}\), \(\lambda_\mathrm{cpx}\), and \(\lambda_\mathrm{cpy}\) of our objective function,
Objective function \({\mathcal L}_\mathrm{proposed} = {\mathcal L}_\mathrm{Seq2Seq} + \lambda_\mathrm{ga} {\mathcal L}_\mathrm{ga} + \lambda_\mathrm{cpx} || \tilde{X} - X ||_1 + \lambda_\mathrm{cpy} || \tilde{Y} - Y ||_1\)

Source speech:
Target speech:

Conversion results after 1,000 epochs training
(Supported: Safari, Chrome, FireFox, Opera)
\(\lambda_\mathrm{ga}\) \(\lambda_\mathrm{cpx}\) \(\lambda_\mathrm{cpy}\) Attention Converted
-- -- --
-- --
-- --
10,000 10 10

Additional results
-- 10 --
-- -- 10
-- 10 10
10,000 -- --
10,000 10 --
10,000 -- 10


  1. Tomoki Toda, Alan W. Black, and Keiichi Tokuda, "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," in IEEE Transactions on ASLP, 2007.
     web page
  2. Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities," in Proc. INTERSPEECH, Aug. 2017.
     web page
  3. John Kominek and Alan W Black, "The CMU Arctic Speech Databases," in Proc. 5th ISCA Speech Synthesis Workshop (SSW5), June 2004.
     web page
  4. Masanori Morise, Fumiya Yokomori, Kenji Ozawa, "WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications," in IEICE, 2016.
     web page
  5. Daniel W. Griffin and Jae S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," in IEEE Transactions on ASSP, 1984.
     web page