Exhibition Program

Science of Media Information


Pay attention to the speaker you want to listen to (Ⅱ)

Neural selective hearing with audio-visual speaker clues


Human beings have the ability to concentrate on listening to a desired speaker (= selective hearing) even when multiple people are speaking at the same time. The purpose of this research is to realize the selective listening mechanism of human beings on a computer. In this research, we propose multimodal selective hearing technology that uses video information as the target speaker’s clues in addition to audio information. By utilizing multiple information sources like humans, the technology become advanced that can operate stably even in situations, where audio clues are useless, such as conversations between speakers with similar voice characteristics. This technology will become fundamentals of various devices that take human voice as input. For example, it will con- tributes to the realization of robots and smart speakers that recognize people and change their response.


  1. T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, T. Nakatani, “Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues,” in Proc. Interspeech, 2019.
  2. K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, 2019.



Tsubasa Ochiai / Signal Processing Research Group, Media Information Laboratory