iSTFTNet2

Check out our related work:

iSTFTNet (ICASSP 2022): Fast and lightweight neural vocoder using iSTFT
MISRNet (Interspeech 2022): Lightweight neural vocoder using multi-input single shared residual blocks
WaveUNetD (ICASSP 2023): Fast and lightweight discriminator using Wave-U-Net
AugCondD (ICASSP 2024): Augmentation-conditional discriminator for limited data

Abstract

The inverse short-time Fourier transform network (iSTFTNet) [1] has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN (e.g., HiFi-GAN [2]) as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via temporal upsampling. However, this strategy compromises the potential to enhance the speed. Therefore, we propose iSTFTNet2, an improved variant of iSTFTNet with a 1D-2D CNN that employs 1D and 2D CNNs to model temporal and spectrogram structures, respectively. We designed a 2D CNN that performs frequency upsampling after conversion in a few-frequency space. This design facilitates the modeling of high-dimensional spectrograms without compromising the speed. The results demonstrated that iSTFTNet2 made iSTFTNet faster and more lightweight with comparable speech quality.

Comparison on LJSpeech
Comparison on VCTK

Results

I. Comparison on LJSpeech

Dataset: LJSpeech [3] (single English speaker)

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6
Ground truth

HiFi-GAN V2 [2]

iSTFTNet-C8C8I4 [1]
iSTFTNet-C8C1I32 [1]

iSTFTNet2-Base
iSTFTNet2-Small

iSTFTNet-MB [1][4]
iSTFTNet2-MB

II. Comparison on VCTK

Dataset: VCTK [5] (multiple English speakers)

	Sample 1 (p227)	Sample 2 (p236)	Sample 3 (p245)	Sample 4 (p272)	Sample 5 (p312)	Sample 6 (p339)
Ground truth

HiFi-GAN V2 [2]

iSTFTNet-C8C8I4 [1]
iSTFTNet-C8C1I32 [1]

iSTFTNet2-Base
iSTFTNet2-Small

Citation

@inproceedings{kaneko2023istftnet2,
  title={{iSTFTNet2}: Faster and More Lightweight {iSTFT}-Based Neural Vocoder Using {1D-2D CNN}},
  author={Takuhiro Kaneko and Hirokazu Kameoka and Kou Tanaka and Shogo Seki},
  booktitle={Interspeech},
  year={2023},
}

References

T. Kaneko, K. Tanaka, H. Kameoka, S. Seki. iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. ICASSP, 2022.
J. Kong, J. Kim, J. Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
K. Ito, L. Johnson. The LJ Speech Dataset. 2017.
G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, L. Xie. Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech. SLT, 2021.
J. Yamagishi, C. Veaux, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh. The Centre for Speech Technology Research, 2017.