AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo


English audio samples

Conventional GMM-based voice conversion considering global varoance [1]
LSTM-based text-to-speech synthesis
(instead of conventional Seq2Seq voice conversion using context information [2])
Our method applied to mel-spectrograms rather than WORLD vocoder features
Just analysis and synthesis via WORLD vocoder features, namely no conversion
Just analysis and synthesis via mel-spectrograms, namely no conversion
  CMU Arctic Databases [3]
Training: 1000 sentences
Evaluation: 132 sentences

Cross gender: clb (female) --> bdl (male)
Natural WORLD vocoder [4] Phase vocoder [5]
Source Target Proposed LSTM-TTS GMM-VC-wGV Anasyn1 Bonus Anasyn2

Intra gender: clb (female) --> slt (female)

Intra gender: rms (male) --> bdl (male)

Cross gender: rms (male) --> slt (female)

Acceleration and Stabilization of Training Procedure

We varied hyperparamters \(\lambda_\mathrm{ga}\), \(\lambda_\mathrm{cpx}\), and \(\lambda_\mathrm{cpy}\) of our objective function,
Objective function \({\mathcal L}_\mathrm{proposed} = {\mathcal L}_\mathrm{Seq2Seq} + \lambda_\mathrm{ga} {\mathcal L}_\mathrm{ga} + \lambda_\mathrm{cpx} || \tilde{X} - X ||_1 + \lambda_\mathrm{cpy} || \tilde{Y} - Y ||_1\)

Source speech:
Target speech:

Conversion results after 1,000 epochs training
\(\lambda_\mathrm{ga}\) \(\lambda_\mathrm{cpx}\) \(\lambda_\mathrm{cpy}\) Attention Converted
-- -- --
-- --
-- --
10,000 10 10

Additional results
-- 10 --
-- -- 10
-- 10 10
10,000 -- --
10,000 10 --
10,000 -- 10


