Exhibition Program

Science of Media Information

17

Pay attention to the speaker you want to listen to (Ⅱ)

Neural selective hearing with audio-visual speaker clues

Abstract

Human beings have the ability to concentrate on listening to a desired speaker (= selective hearing) even when multiple people are speaking at the same time. The purpose of this research is to realize the selective listening mechanism of human beings on a computer. In this research, we propose multimodal selective hearing technology that uses video information as the target speaker’s clues in addition to audio information. By utilizing multiple information sources like humans, the technology become advanced that can operate stably even in situations, where audio clues are useless, such as conversations between speakers with similar voice characteristics. This technology will become fundamentals of various devices that take human voice as input. For example, it will con- tributes to the realization of robots and smart speakers that recognize people and change their response.

References

  1. T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, T. Nakatani, “Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues,” in Proc. Interspeech, 2019.
  2. K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, 2019.

Poster

Contact

Tsubasa Ochiai / Signal Processing Research Group, Media Information Laboratory
Email: