Joint Denoising and Dereverberation Sound Demo (1ch)

The sound demos on this page were sampled from experiments detailed in the following paper:

T. Nakatani, N. Kamo, M. Delcroix, S. Araki, "A Hybird Probabilistic-Deterministic Model Recurssively Enhancing Speech," accepted for IEEE ICASSP 2025.

This paper introduces a novel single-channel speech enhancement (SE) method based on neural networks (NN), named Probabilistic-Deterministic Recursive Enhancement (PDRE). For more details, please refer to the paper.

You can also find video demos showcasing streaming speech enhancement and streaming speech dereverberation with a 2-second processing delay.

Sound demos in this page:

Use two datasets (simulated and real datesets).
Compare PDRE (proposed) with conventional deterministic NN-based and diffusion model-based SE methods

The two datasets:

Sound demo1: Real recordings (mismatched condition), extracted from REVERB challenge dataset
Sound demo2: Simulated noisy and reverberant speech (matched condition), generated by mixing speech data from WSJ0 and noise data from CHiME3, respectively reverberated using room impulse responses synthesized by the image method

Compared methods, along with their quality scores and Real Time Factors (RTFs):

Methods	Description	SI-SDR (dB)	fwsSNR (dB)	ESTOI	PESQ	RTF
Observation	No SE applied	-3.6	4.6	0.47	2.32
detNN	A deterministic NN-based SE, trained to map distorted speech to clean speech	6.3	11.4	0.83	2.32	0.021
mSGMSE+Ensemble	A diffusion model-based SE, multi-stream extension of Score-based Generative Model for SE, being integrated with detNN and using ensemble inference	8.1	11.7	0.86	2.58	7.87
PDRE (proposed)	SE based on Probabilistic-Deterministic Recursive Enhancement	8.4	12.7	0.87	2.56	0.077
Clean	Clean speech reference containing direct signal and early reflections within 2 ms after the direct signal

Advantages of PDRE (proposed):

Compared with detNN
Significantly improve signal quality with relatively small increase in RTF
Compared with mSGMSE+Ensemble
Achieve comparable (or slightly better) signal quality with significantly reduced RTF (less than 1/100).