| 08 |
Listening to what you want!Real-time selective listening of everyday sounds |
|---|
Humans can selectively listen to a target sound even when many sounds overlap. This research brings that capability to computers by developing real-time target sound extraction that isolates desired audio from mixed signals on general-purpose PCs while maintaining high accuracy. By incorporating an audio foundation model with general sound representations developed at NTT, the method further improves extraction accuracy and sound quality. We also implement binaural processing to estimate the direction of arrival, making the system closer to human listening. Ultimately, the technology lets users flexibly hear or suppress sounds depending on the context, for example, by reducing household noise in remote-work meetings while preserving meaningful sounds during family calls, enabling more comfortable and effective communication.
[1] M. Delcroix, J. B. Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi, S. Araki, “SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 31, pp.121-136, 2022..
[2] K. Wakayama, T. Ochiai, M. Delcroix, M. Yasuda, S. Saito, S. Araki, A. Nakayama, “Online target sound extraction with knowledge distillation from partially non-causal teacher,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 561-565, 2024.
[3] C. Hernandez-Olivan, M. Delcroix, T. Ochiai, D. Niizumi, N. Tawara, T. Nakatani, S. Araki, “SoundBeam meets M2D: Target sound extraction with audio foundation model,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025.
[4] C. Hernandez-Olivan, M. Delcroix, T. Ochiai, N. Tawara, T. Nakatani, S. Araki, “Interaural time difference loss for binaural target sound extraction,” in Proc. 18th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 210-214), 2024. IEEE.
Marc Delcroix, Media Information Laboratory, Signal Processing Research Group