HOME / Lecture video / Research Talk
Research Talk

Developing AI that pays attention to who you want to listen to
Deep learning based selective hearing with SpeakerBeam
Marc Delcroix
Signal Processing Research Group, Media Information Laboratory

Abstract

stening to a desired speaker, an ability known as selective hearing. In this talk, we discuss approaches to realize computational selective hearing. We first introduce SpeakerBeam, which is a deep learning-based method we proposed to extract speech of a desired target speaker in a mixture of several speakers, by exploiting a few seconds of pre-recorded audio data of the target speaker. We thenpresent recent research that includes (1) extension to multi-modal processing, where we exploit video of the lip movement of the target speaker in addition to the audio pre-recording, (2) integration with automatic speech recognition, and (3) the generalization to the extraction of arbitrary sounds.In a noisy environment such as a cocktail party, humans can focus on lidesired speaker, an ability known as selective hearing. In this talk, we discuss approaches to realize computational selective hearing. We first introduce SpeakerBeam, which is a deep learning-based method we proposed to extract speech of a desired target speaker in a mixture of several speakers, by exploiting a few seconds of pre-recorded audio data of the target speaker. We thenpresent recent research that includes (1) extension to multi-modal processing, where we exploit video of the lip movement of the target speaker in addition to the audio pre-In a noisy environment such as a cocktail party, humans can focus on listening to a desired speaker, an ability known as selective hearing. In this talk, we discuss approaches to realize computational selective hearing. We first introduce SpeakerBeam, which is a deep learning-based method we proposed to extract speech of a desired target speaker in a mixture of several speakers, by exploiting a few seconds of pre-recorded audio data of the target speaker. We then present recent research that includes (1) extension to multi-modal processing, where we exploit video of the lip movement of the target speaker in addition to the audio pre-recording, (2) integration with automatic speech recognition, and (3) the generalization to the extraction of arbitrary sounds.

[1] M. Delcroix, K. Zmolikova, K. Kinoshita, S. Araki, A. Ogawa, and T. Nakatani, “SpeakerBeam: A New Deep Learning Technology for Extracting Speech of a Target Speaker Based on the Speaker's Voice Characteristics,” NTT Technical Review, Vol. 16, No. 11, pp. 19–24, Nov. 2018.
[2] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 4, pp. 800–814, 2019.
[3] T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani, “Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues,” in Proc. The 20th Annual Conference of the International Speech Communication Association(INTERSPEECH), pp. 2718-2722, 2019.
[4] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, A. Romanenko, “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. The 21th Annual Conference of the International Speech Communication Association(INTERSPEECH), pp. 274-278, 2020.
[5] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita,A. Ogawa, and T. Nakatani, “End-to-end speakerbeam for single channel target speech recognition,” in Proc. The 20th Annual Conference of the International Speech Communication Association(INTERSPEECH), pp. 451-455, 2019.
[6] T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita, and S. Araki, “Listen to what you want: Neural network-based universal sound selector,” in Proc. The 21th Annual Conference of the International Speech Communication Association(INTERSPEECH), pp. 2718–2722, 2020.
[7] Y. Ohishi, A. Kimura, T. Kawanishi, K. Kashino, D. Harwath, J. Glass, “Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets,” in Proc. The 21th Annual Conference of the International Speech Communication Association(INTERSPEECH), pp. 1486-1490, 2020.

Speaker
Marc Delcroix
Marc Delcroix
Signal Processing Research Group, Media Information Laboratory