iSTFTNet

Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

ICASSP 2022
(arXiv:2203.02395, Mar. 2022)

Paper

iSTFTNet
Figure 1. Comparison of a standard convolutional mel-spectrogram vocoder and iSTFTNet (ours).

News

Check out our follow-up work: MISRNet / Wave-U-Net Discriminator / iSTFTNet2

Abstract

In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder (Figure 1(a)) solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet (Figure 1(b)), which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality.

Experiment results

Main results
  1. Synthesis from ground-truth mel-spectrogram
  2. Application to text-to-speech synthesis
Supplementary results
  1. Application to multiple speakers
  2. Application to Japanese

I. Synthesis from ground-truth mel-spectrogram

  • Dataset: LJSpeech [1]
  • Input: Ground truth mel-spectrogram
  • Note: The models denoted by pink color is iSTFTNet.
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Ground truth
V1 [2]
V1-C8C8C2I
V1-C8C8I
V1-C8I
V1-C8C1I
V2 [2]
V2-C8C8C2I
V2-C8C8I
V2-C8I
V2-C8C1I
V3 [2]
V3-C8C8I
V3-C8I
V3-C8C1I
MB-MelGAN [3]
PWG [4]

II. Application to text-to-speech synthesis

  • Dataset: LJSpeech [1]
  • Input: Text
  • Note: The models denoted by pink color is iSTFTNet.
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Text made certain recommendations which it believes would, if adopted, materially improve upon the procedures in effect at the time of President Kennedy's assassination and result in a substantial lessening of the danger. As has been pointed out, the Commission has not resolved all the proposals which could be made. The Commission nevertheless is confident that, with the active cooperation of the responsible agencies and with the understanding of the people of the United States in their demands upon their President, the recommendations we have here suggested would greatly advance the security of the office without any impairment of our fundamental liberties.
Ground truth
C-FS2 + V1 [2]
C-FS2 + V1-C8C8I
C-FS2 [5]

III. Application to multiple speakers

  • Dataset: VCTK [6]
  • Input: Ground truth mel-spectrogram
  • Note: The models denoted by pink color is iSTFTNet.
Sample 1 (p240) Sample 2 (p260) Sample 3 (p280) Sample 4 (p311) Sample 5 (p335)
Ground truth
V1 [2]
V1-C5C5I
V2 [2]
V2-C5C5I
V3 [2]
V3-C10C6I

IV. Application to Japanese

  • Dataset: JSUT [7]
  • Input: Ground truth mel-spectrogram
  • Note: The models denoted by pink color is iSTFTNet.
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Ground truth
V1 [2]
V1-C5C5I
V2 [2]
V2-C5C5I
V3 [2]
V3-C10C6I

Citation

@inproceedings{kaneko2022istftnet,
  title={{iSTFTNet}: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time {Fourier} Transform},
  author={Takuhiro Kaneko and Kou Tanaka and Hirokazu Kameoka and Shogo Seki},
  booktitle={ICASSP},
  year={2022},
}

References

  1. K. Ito and L. Johnson. The LJ Speech Dataset. 2017.
  2. J. Kong, J. Kim, J. Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
  3. G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, L. Xie. Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech. SLT, 2021.
  4. J. Kong, J. Kim, J. Bae. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. ICASSP, 2020.
  5. P. Guo, F. Boyer, X. Chang, T. Hayashi, Y. Higuchi, H. Inaguma, N. Kamo, C. Li, D. Garcia-Romero, J. Shi, J. Shi, S. Watanabe, K. Wei, W. Zhang, Y. Zhang. Recent Developments on ESPnet Toolkit Boosted by Conformer. ICASSP, 2021.
  6. C. Veaux, J. Yamagishi, K. MacDonald. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh. The Centre for Speech Technology Research, 2017.
  7. R. Sonobe, S. Takamichi, H. Saruwatari. JSUT Corpus: Free Large-Scale Japanese Speech Corpus for End-to-End Speech Synthesis. arXiv preprint arXiv:1711.00354, 2017.

Our follow-up work

  1. T. Kaneko, H. Kameoka, K. Tanaka, S. Seki. MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks Interspeech, 2022. Project
  2. T. Kaneko, H. Kameoka, K. Tanaka, S. Seki. Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis ICASSP, 2023. Project
  3. T. Kaneko, H. Kameoka, K. Tanaka, S. Seki. iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN. Interspeech, 2023. Project