Realtime Multimodal System for Conversation Scene Analysis
1. Overview of system
May 2008, NTT Communication Science Labs (referred to CS Labs.) exhibited a multimodal system for analyzing conversation scenes at their Openhouse 2008. This system targets face-to-face conversations and captures them with an omnidirectional camera-microphone system. From the image data captured by the cameras, the face position and pose of each meeting participant are measured. From the audio data collected by the microphones, voice activity and its arrival direction are detected. By integrating this information, the system can estimate the state of conversation such as "who talks to whom and when" and "who attracts attention from others", and can visualize the results; all processes are done in realtime. Also, the system can record the input data and estimated results for later playback. The real time response offered by our system distinguishes it from other systems that target meeting scenes. This was made possible by the use of a GPU-based face pose tracker (called STCTracker), which can track multiple faces in realtime by using the power of modern GPUs.
This demonstration system is a part of our research project for developing a machine that can understand conversation scenes, and currently realizes just a tiny fraction of the functionalities that have been studied in the project over the past few years. Further extensions of the system are being planned.
Conversation scenes and system output. The front-side monitor shows the realtime result.
These demo movies were captured with another camera system to confirm the realtime performance of the system.
2. Omnidirectional camera-microphone system
This camera-microphone system consists of two cameras and three microphones. Each camera has with a fisheye lens with a hemispherical view; the two fisheye cameras face opposite directions and so can provide approximately spherical coverage. Three microphones are placed at the vertices of a triangle to form a microphone array.
This omnidirectional capture system can provide grater image quality than mirror-based technologies. Also, it can minimize the number of discontinuities caused by combining multiple images captured with multiple cameras, as occur with omni-vision cameras like PointGreyResearch's Ladybug2. However, the two small dead zones are the main drawback of our current system.
3. Flow of processing
4. Visualization Scheme 1: Panoramic View
This picture is an actual screenshot of the PC display during a meeting. The two images from two cameras are vertically stacked. Green meshes illustrate the results of face tracking, and the red dots along the axes indicate the DOA of voice.
5. Visualization Scheme 2: 3-D View
These pictures show how the results can be visualized in 3D. (a) is a cylindrical visualization that gives viewers the overview of meeting scenes. Also (a) shows the relative position of each meeting participant (indicated as a circle), and shows the approximate field of view as (blue) translucent triangles. The voice activity of each participant is displayed by the red dot in each person's circle.
(b) illustrates the second visualization scheme, called piecewise planar representation; the face image of each person is mapped onto a planer surface. Also it shows the discrete gaze direction of each person by arrows and the focus of attention, people who are attracting the gaze of more than a person, are indicated by a circle(s).
Furthermore, our system offers a maneuverable interface by using a 3-D mouse; the users can freely and intuitively manipulate their viewpoints, as shown in (c) and (d).
6. Exhibition at OpenHouse2008
May 29 to 30, 2008, at CS Lab. OpenHouse2008, held at Keihanna, Kyoto, Japan, many guests visited our site and experienced the demo system.
Left picture: demonstration site. A presenter explains about the system while visitors try the system out.
Right picture: the screen of the system is relayed to another site (actually a hallway near the demo site). The visitors observe the system as a remote viewer of meetings.
7. Introduction Video
8. Future Plan
We will continue to develop our system and aim to create new meeting-related applications that will enhance communication capability. Examples include the automatic generation of multimedia minutes of meetings, automatic camera work for tele-conferencing, and social robots/agents. To realize such applications, we believe that it is important to understand communication by decoding the multimodal signals of humans such as head gestures, facial expressions, and prosody in meeting situations.
K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato, "A Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization", Proc. ACM 10th Int. Conf. Multimodal Interfaces (ICMI2008) , pp. 257—264, 2008
© ACM, (2008). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ICMI’08, October 20–22, 2008, Chania, Crete, Greece. Copyright 2008 ACM 978-1-60558-198-9/08/10
-Eizo Shinbun, June 9, 2008 NTT Communication Science Labs. OpenHouse2008/08/30 (In Japanese)
-Nikkan Kogyo Shinbun, June 10 (In Japanese)
-K. Otsuka and S. Araki, ITU Journal, Vol. 38, No. 8, pp.5--7，2008 (In Japanese)
Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.