Real-time Meeting Recognition and Understanding using Distant Microphone Array and Omni-directional Camera

last modified: Dec. 27, 2011.

Fig. 1: Meeting room.
(click for larger image)
This page demonstrates a prototype low-latency online meeting recognizer [1,2] for group meetings, which has been developed by NTT Communication Science Laboratories, Signal Processing Research Group.
Our system automatically recognizes "who speaks what to whom and when" in an online manner, by using the audio and visual information captured by a microphone array and an omnidirectional camera at the center of a table (Fig. 1).


0: Introduction
1: Prototype of real-time meeting browser
2: Meeting assistance system with our meeting recognizer
3: System architecture of the meeting recognizer
4: References

1: Prototype of real-time meeting browser

Fig. 2: Prototype of meeting browser
(click for more details).
Our system analyzes both by integrating audio and visual information.

All the analysis results are continuously displayed on a browser (Fig. 2):

2: Meeting assistance system with our meeting recognizer

Fig. 3: Meeting assistance with a smart-phone
(click for larger image).
We also implemented a meeting assistance system as an application of our real-time meeting recognizer.
The system visualizes the meeting recognition results on tablet PCs and smart-phones (Fig. 3) in an online manner. We can also easily retrieve and access the recorded meeting information by using this meeting assistance system.
  • Demo Video on Meeting Assistance system on a tablet PC (100 MB) (Tablet PC version will appear after 0:48. Sorry all are in Japanese, but we hope you can get an idea of this system).
  • 3: System architecture of the meeting recognizer

    Fig. 4: System architecture
    (click for larger image)
    Figure 4 shows the system architecture.
    Part (A): Speech enhancement: In meeting recognition scenarios, we need to remove several types of audio distortion, e.g., additive noises, speech overlaps, and reverberation. To combat all the types of distortion, we employ four techniques [4]:
    1. Dereverberation based on multi-channel weighted linear prediction [5]
    2. Speaker diarization (= estimation of "who speak when") based on direction-of-arrival [6]
    3. Source separation with beamforming approach [6]
    4. Noise suppression with mel-scaled Wiener filter [7]

    Part (B):Speech recognition provides the transcribed utterances for each enhanced signal for each speaker. We employ the speech recognizer SOLON (Speech recognizer with OutLook On the Next generation) [8] developed at our laboratories, which achieves high-accuracy and fast speech recognition using discriminatively trained acoustic and language models [9][10].
    SOLON adopts an efficient WFST-based decoding algorithm [11] with a fast on-the-fly composition technique that enables to integrate different language models in low-latency one-pass decoding [12].

    SOLON is used both for

    For browsing of ongoing meetings, we apply our proposed techniques to obtain the analysis results with low latency from speech recognition results.
    Detailed information can be found in [1,2] and its references.

    Part (C):Visual processing: The face pose tracker [15,16] provides head pose information of each participant, from which we estimate "who is looking at whom" and "visual focus of attention".
    Our system also estimates "who is speaking to whom", by combining the audio and visual information,

    4: References

    Back to Signal Processing Research Group
    Back to Research Topics of the Signal Processing Research Group