Real-time Meeting Recognition and Understanding using Distant Microphone Array and Omni-directional Camera

last modified: Dec. 27, 2011.

Fig. 1: Meeting room.
(click for larger image)

This page demonstrates a prototype low-latency online meeting recognizer [1,2] for group meetings, which has been developed by NTT Communication Science Laboratories, Signal Processing Research Group.
Our system automatically recognizes "who speaks what to whom and when" in an online manner, by using the audio and visual information captured by a microphone array and an omnidirectional camera at the center of a table (Fig. 1).

Demo Video on Real-time Meeting Recognizer and Browser (267 MB).
Demo Video on Meeting Assistance system on a tablet PC (100 MB).
Demo Video on Multi-modal Communication Scene Analysis (93 MB).

We are sorry that all the conversations in the videos are in Japanese, but we hope you can get an idea of our system. Captions and narrations are in English.

0: Introduction

1: Prototype of real-time meeting browser

2: Meeting assistance system with our meeting recognizer

3: System architecture of the meeting recognizer

4: References

1: Prototype of real-time meeting browser

Fig. 2: Prototype of meeting browser
(click for more details).

Our system analyzes both

the activity of each participant (e.g. speaking when and what, laughing, watching to whom),
and the atmosphere of the meeting (e.g. topic, activeness, casualness)

by integrating audio and visual information.

All the analysis results are continuously displayed on a browser (Fig. 2):

Left: Live streaming Video of a 360-degree view provided by the camera, together with analysis result on "who is speaking to whom" and "who is looking at whom".
Upper-right: Status of each participants with real-time transcript, current state (event detection (speaking, laughing, or silent), the number of spoken words, and the visual focus of attention).
Lower-right: Meeting atmosphere including the activeness, casualness, and the topic words of the meeting.

Demo Video on Real-time Meeting Recognizer and Browser (267 MB).
Demo Video on Multi-modal Communication Scene Analysis (93 MB).

2: Meeting assistance system with our meeting recognizer

Fig. 3: Meeting assistance with a smart-phone
(click for larger image).

We also implemented a meeting assistance system as an application of our real-time meeting recognizer.
The system visualizes the meeting recognition results on tablet PCs and smart-phones (Fig. 3) in an online manner. We can also easily retrieve and access the recorded meeting information by using this meeting assistance system.

Online-mode usage examples:

Assist and activation of a discussion.
Searching the unknown terms in the discussion with a smart-phone.
High realistic TV conference system by transmitting multi-modal information.

Offline-mode usage examples:

Meeting minutes taking
Archiving meetings
Check and training of the communication skill

Demo Video on Meeting Assistance system on a tablet PC (100 MB) (Tablet PC version will appear after 0:48. Sorry all are in Japanese, but we hope you can get an idea of this system).

3: System architecture of the meeting recognizer

Fig. 4: System architecture
(click for larger image)

Figure 4 shows the system architecture.
Part (A): Speech enhancement: In meeting recognition scenarios, we need to remove several types of audio distortion, e.g., additive noises, speech overlaps, and reverberation. To combat all the types of distortion, we employ four techniques [4]:

Dereverberation based on multi-channel weighted linear prediction [5]
Speaker diarization (= estimation of "who speak when") based on direction-of-arrival [6]
Source separation with beamforming approach [6]
Noise suppression with mel-scaled Wiener filter [7]

Part (B):Speech recognition provides the transcribed utterances for each enhanced signal for each speaker. We employ the speech recognizer SOLON (Speech recognizer with OutLook On the Next generation) [8] developed at our laboratories, which achieves high-accuracy and fast speech recognition using discriminatively trained acoustic and language models [9][10].
SOLON adopts an efficient WFST-based decoding algorithm [11] with a fast on-the-fly composition technique that enables to integrate different language models in low-latency one-pass decoding [12].

SOLON is used both for

Speech recognition
and Acoustic event detection (silence, speech and laughter).

For browsing of ongoing meetings, we apply our proposed techniques

Sentence boundary detection [13]
Topic tracking [14]

to obtain the analysis results with low latency from speech recognition results.
Detailed information can be found in [1,2] and its references.

Part (C):Visual processing: The face pose tracker [15,16] provides head pose information of each participant, from which we estimate "who is looking at whom" and "visual focus of attention".
Our system also estimates "who is speaking to whom", by combining the audio and visual information,

4: References

System Overview

Speech Enhancement & Diarization

Speech recognition

[pdf]

Image processing and Non-verbal communication scene analysis

Link

Back to Signal Processing Research Group
Back to Research Topics of the Signal Processing Research Group

Real-time Meeting Recognition and Understanding using Distant Microphone Array and Omni-directional Camera

Contents

1: Prototype of real-time meeting browser

2: Meeting assistance system with our meeting recognizer

3: System architecture of the meeting recognizer

4: References