Sound demo for multi-channel joint denoising and dereverberation

This page contains speech signals sampled from the experiments presented in the following paper.

T. Nakatani, N. Kamo, D. Marc, S. Araki, "Multi-stream diffusion model for probabilistic integration of model-based and data-driven speech enhancement," in Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 65-69, September 2024.

The paper proposes a multi-stream diffusion model that integrates a diffusion model-based speech enhancement method, mSGMSE [1,2], with a blind dereverberation technique, Weighted Prediction Error (WPE) [3,4]. For further details, please refer to the paper.

Speech signals for comparison (please use headphones for optimal listening)

Female and male speech signals were taken from the REVERB Challenge dataset, respectively, consisting of 2-channel real recordings, and processed using three different methods. These include:

Recorded speech (original reverberant recording)
Speech enhanced by WPE
Speech enhanced by a Diffusion Model-based method with ensemble inference [5]
Speech enhanced by the proposed method, which integrates WPE and the diffusion model (WPE+Diffusion model) with ensemble inference

	Attention	Female speech	Male speech
Recorded speech	Noisy and reverberant speech
WPE	Dereverberated speech (Noise is not reduced)
Diffusion model	Denoised and dereverberated speech
WPE+ Diffusion model (proposed method)	More accurately denoised and dereverberated speech

Other related work

[1]	Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann, "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364, 2023.
[2]	Rino Kimura, Tomohiro Nakatani, Naoyuki Kamo, Marc Delcroix, Shoko Araki, Tetsuya Ueda, and Shoji Makino, "Diffusion model-based MIMO speech denoising and dereverberation," in Proc. Hands-free Speech Communication and Microphone Arrays (HSCMA), 2024.
[3]	Tomohiro Nakatani, Takuya Yoshioka, Keisuke Kinoshita, Masato Miyoshi, "Speech dereverberation based on variance-normalized delayed linear prediction," IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717-1731, August 2010
[4]	Takuya Yoshioka and Tomohiro Nakatani, "Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening," IEEE Tran. Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012.
[5]	Naoyuki Kamo, Marc Delcroix, and Tomohiro Nakatani, "Target speech extraction with conditional diffusion model," in Proc. Interspeech, pp. 176–180, 2023.