Human beings have the ability to concentrate on listening to a desired speaker (= selective hearing) even when multiple people are speaking at the same time. The purpose of this research is to realize the selective listening mechanism of human beings on a computer. In this research, we propose multimodal selective hearing technology that uses video information as the target speaker’s clues in addition to audio information. By utilizing multiple information sources like humans, the technology become advanced that can operate stably even in situations, where audio clues are useless, such as conversations between speakers with similar voice characteristics. This technology will become fundamentals of various devices that take human voice as input. For example, it will con-tributes to the realization of robots and smart speakers that recognize people and change their response.
/ Signal Processing Research Group, Media Information Laboratory