Paper

WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

 PDF

Japanese audio samples (reported in the paper)

Due to a licence, we don't have a permission to show audio samples.
(We are going to train models by using an alternative Japanese database which allows us to publish.)

English audio samples

Systems
  Merlin
DNN-based TTS [1]
  V1
Conventional WaveCycleGAN [2]
(applied to Merlin's results)
  V2msp
WaveCycleGAN2, namely improved WaveCycleGAN
(applied to Merlin's results)
  WORLD
Just analysis and synthesis via parametric vocoder features [3]
(consisting of 25 dimensional mel-cepstrum, f0, and coded aperiodicity)
  Bonus
WaveCycleGAN2
(applied to WORLD's results)
Database
  16 kHz sampling: CMU Arctic Databases [4]
Training: 1000 sentences
Evaluation: 132 sentences

Male speaker: bdl
(Supported: Safari, Chrome, FireFox, Opera)
TTS AnaSyn
Natural Merlin V1 V2msp WORLD Bonus

Female speaker: slt

Analysis-and-Synthesis of English audio samples

Systems
  WORLD
Parametric Vocoder [3] (consisting of 34 dimensional mel-cepstrum, f0, and coded aperiodicity)
  GL
Phase vocoder based on Griffin-Lim method [5]
  WaveNet
Open WaveNet [6] employing mixture of logistic distributions
(* not official WaveNet: [7] *)
(* Audio samples are brought from a public folder [8] of WaveGlow *)
  WaveGlow
Official WaveGlow [8]
(* Audio samples are brought from a public folder [8] of WaveGlow *)
  V2msp
WaveCycleGAN2 with parallel-data condition
(applied to WORLD's results)
  V2msp'
WaveCycleGAN2 with non-parallel-data condition
(applied to WORLD's results)
Database
  22.05 kHz sampling: LJSpeech Database [9]
Training of V2msp and V2msp': 13060 sentences
Evaluation of V2msp and V2msp': 40 sentences

(Supported: Safari, Chrome, FireFox, Opera)
AnaSyn
Natural WORLD GL WaveNet WaveGlow V2msp V2msp'

To promote the other waveform generation research, we also publish our results.  tar.gz
We really appreciate that WaveGlow's authors published the audio samples of WaveGlow.

Bonus: Applying to HTS-Engine

Systems
  HTS
HTS-Engine [10]: OpenJTalk + HMM acoustic models
  NSF
Neural source-filter-based waveform model [11]
(* Audio samples are brought from a official web page [11] of NSF *)
  V2msp
WaveCycleGAN2 with parallel-data condition
(applied to HTS's results)
  V2msp'
WaveCycleGAN2 with non-parallel-data condition
(applied to HTS's results)
Database
  48 kHz sampling: ATR speech database [12] (in Japanese <- Speaker dependent training demo <- Download <- [10])
Training of V2msp and V2msp': 450 sentences
Evaluation of V2msp and V2msp': 53 sentences

(Supported: Safari, Chrome, FireFox, Opera)
TTS
HTS NSF V2msp V2msp'

References

  1. Zhizheng Wu, Oliver Watts, and Simon King, "Merlin: An Open Source Neural Network Speech Synthesis System," in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), Sep. 2016.
     web page
  2. Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka, "WaveCycleGAN: Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks," in Proc. IEEE Spoken Language Technology (SLT), Dec. 2018.
     web page
  3. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, "WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications," in IEICE Transactions on Information and Systems, 2016.
     web page
  4. John Kominek and Alan W. Black, "The CMU Arctic Speech Databases," in Proc. 5th ISCA Speech Synthesis Workshop (SSW5), June 2004.
     web page
  5. Daniel W. Griffin and Jae S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," in IEEE Transactions on ASSP, 1984.
     web page
  6. Ryuichi Yamamoto, "Wavenet Vocoder,"
     web page
  7. AƤron V. D. Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu, "WaveNet: A Generative Model for Raw Audio," in Proc. 9th Speech Synthesis Workshop (SSW9), Sep. 2016.
     web page
  8. Ryan Prenger, Rafael Valle, and Bryan Catanzaro, "WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS," in Proc. IEEE ICASSP, 2019.
     web page
  9. Keith Ito, "The LJ Speech Dataset,"
     web page
  10. HTS Working Group, "HMM/DNN-based Speech Synthesis System (HTS),"
     web page
  11. Xin Wang, Shinji Takaki, and Junichi Yamagishi, "Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis," in Proc. IEEE ICASSP, 2019.
     web page
  12. Akira Kurematsu, Kazuya Takeda, Yoshinori Sagisaka, Shigeru Katagiri, Hisao Kuwabara, and Kiyohiro Shikano, "ATR japanese speech database as a tool of speech recognition and synthesis," in Speech Communication, 1990.
     web page