| Attention | Female speech | Male speech | |
| Recorded speech |
Noisy and reverberant speech |
|
|
| WPE | Dereverberated speech (Noise is not reduced) |
|
|
| Diffusion model |
Denoised and dereverberated speech |
|
|
| WPE+ Diffusion model (proposed method) |
More accurately denoised and dereverberated speech |
|
|
| [1] | Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann, "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364, 2023. |
| [2] | Rino Kimura, Tomohiro Nakatani, Naoyuki Kamo, Marc Delcroix, Shoko Araki, Tetsuya Ueda, and Shoji Makino, "Diffusion model-based MIMO speech denoising and dereverberation," in Proc. Hands-free Speech Communication and Microphone Arrays (HSCMA), 2024. |
| [3] | Tomohiro Nakatani, Takuya Yoshioka, Keisuke Kinoshita, Masato Miyoshi, "Speech dereverberation based on variance-normalized delayed linear prediction," IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717-1731, August 2010 |
| [4] | Takuya Yoshioka and Tomohiro Nakatani, "Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening," IEEE Tran. Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012. |
| [5] | Naoyuki Kamo, Marc Delcroix, and Tomohiro Nakatani, "Target speech extraction with conditional diffusion model," in Proc. Interspeech, pp. 176–180, 2023. |