Paper

WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo
PDF

Japanese audio samples (reported in the paper)

Due to a licence, we don't have a permission to show audio samples.
(We are going to train models by using an alternative Japanese database which allows us to publish.)

English audio samples

Systems

Merlin

DNN-based TTS [1]

V1

Conventional WaveCycleGAN [2]
(applied to Merlin's results)

V2msp

WaveCycleGAN2, namely improved WaveCycleGAN
(applied to Merlin's results)

WORLD

Just analysis and synthesis via parametric vocoder features [3]
(consisting of 25 dimensional mel-cepstrum, f0, and coded aperiodicity)

Bonus

WaveCycleGAN2
(applied to WORLD's results)

Database

16 kHz sampling: CMU Arctic Databases [4]

Training: 1000 sentences
Evaluation: 132 sentences

Male speaker: bdl
(Supported: Safari, Chrome, FireFox, Opera)
TTS AnaSyn

Natural Merlin V1 V2msp WORLD Bonus

Female speaker: slt

Analysis-and-Synthesis of English audio samples

Systems

WORLD

Parametric Vocoder [3] (consisting of 34 dimensional mel-cepstrum, f0, and coded aperiodicity)

GL

Phase vocoder based on Griffin-Lim method [5]

WaveNet

Open WaveNet [6] employing mixture of logistic distributions
(* not official WaveNet: [7] *)
(* Audio samples are brought from a public folder [8] of WaveGlow *)

WaveGlow

Official WaveGlow [8]
(* Audio samples are brought from a public folder [8] of WaveGlow *)

V2msp

WaveCycleGAN2 with parallel-data condition
(applied to WORLD's results)

V2msp'

WaveCycleGAN2 with non-parallel-data condition
(applied to WORLD's results)

Database

22.05 kHz sampling: LJSpeech Database [9]

Training of V2msp and V2msp': 13060 sentences
Evaluation of V2msp and V2msp': 40 sentences

(Supported: Safari, Chrome, FireFox, Opera)
AnaSyn

Natural WORLD GL WaveNet WaveGlow V2msp V2msp'

To promote the other waveform generation research, we also publish our results. tar.gz
We really appreciate that WaveGlow's authors published the audio samples of WaveGlow.

Bonus: Applying to HTS-Engine

Systems

HTS

HTS-Engine [10]: OpenJTalk + HMM acoustic models

NSF

Neural source-filter-based waveform model [11]
(* Audio samples are brought from a official web page [11] of NSF *)

V2msp

WaveCycleGAN2 with parallel-data condition
(applied to HTS's results)

V2msp'

WaveCycleGAN2 with non-parallel-data condition
(applied to HTS's results)

Database

48 kHz sampling: ATR speech database [12] (in Japanese <- Speaker dependent training demo <- Download <- [10])

Training of V2msp and V2msp': 450 sentences
Evaluation of V2msp and V2msp': 53 sentences

(Supported: Safari, Chrome, FireFox, Opera)
TTS

HTS NSF V2msp V2msp'

References

Zhizheng Wu, Oliver Watts, and Simon King, "Merlin: An Open Source Neural Network Speech Synthesis System," in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), Sep. 2016.
web page

Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka, "WaveCycleGAN: Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks," in Proc. IEEE Spoken Language Technology (SLT), Dec. 2018.
web page

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, "WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications," in IEICE Transactions on Information and Systems, 2016.
web page

John Kominek and Alan W. Black, "The CMU Arctic Speech Databases," in Proc. 5th ISCA Speech Synthesis Workshop (SSW5), June 2004.
web page

Daniel W. Griffin and Jae S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," in IEEE Transactions on ASSP, 1984.
web page

Ryuichi Yamamoto, "Wavenet Vocoder,"
web page

Aäron V. D. Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu, "WaveNet: A Generative Model for Raw Audio," in Proc. 9th Speech Synthesis Workshop (SSW9), Sep. 2016.
web page

Ryan Prenger, Rafael Valle, and Bryan Catanzaro, "WAVEGLOW: A FLOW-BASED GENERATIVE NETWORK FOR SPEECH SYNTHESIS," in Proc. IEEE ICASSP, 2019.
web page

Keith Ito, "The LJ Speech Dataset,"
web page

HTS Working Group, "HMM/DNN-based Speech Synthesis System (HTS),"
web page

Xin Wang, Shinji Takaki, and Junichi Yamagishi, "Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis," in Proc. IEEE ICASSP, 2019.
web page

Akira Kurematsu, Kazuya Takeda, Yoshinori Sagisaka, Shigeru Katagiri, Hisao Kuwabara, and Kiyohiro Shikano, "ATR japanese speech database as a tool of speech recognition and synthesis," in Speech Communication, 1990.
web page