Face Pose Tracker using GPU  -STCTracker-

Back to home

              To Japanese Page

Keywords: GPGPU, GPU Computing, NVIDIA CUDA, Face Tracking, Particle Filter


July 2008, NTT CS Labs. released a paper via Springer. This web page provides additional information about the face tracker proposed in the paper.

Oscar Mateo Lozano and Kazuhiro Otsuka

Real-time visual tracker by Stream processing  ---Simultaneous and fast 3D tracking of multiple faces in video sequences by using a particle filter ---

Journal of VLSI Signal Processing Systems     

Freely downloadable from http://www.springerlink.com/content/pk22n1632859082k/


Abstract of Paper

In this work, we implement a real-time visual tracker that targets the position and 3D pose of objects in video sequences, specifically faces. The use of stream processors to perform the computations and efficient Sparse-Template-based particle filtering allows us to achieve real-time processing even when tracking multiple objects simultaneously in high-resolution video frames. Stream processing is a relatively new computing paradigm that permits the expression and execution of data-parallel algorithms with great efficiency and minimum effort. By using a GPU (graphics processing unit, a consumer-grade stream processor) and the NVIDIA CUDA™ technology, we have achieved significant performance improvements,

up to ten times the performance a similar CPU-only tracker. At the same time, the Stream processing approach opens the door to other computing devices, like the Cell/BE™ or other multicore CPUs.



Tracking results

Meshes show the position and pose of the face being tracked. They also show the face shapes, which was built during initialization.

Speed of the full application, comparing the stream processing performance of the GPU version to that of a serial CPU-only version. Video 1,024 × 768, 230 feature points per face, and 1,000 particles per face.




Currently, NTT CS Labs is working on Conversation Scene Analysis, which aims to realize automatic analysis and understanding face-to-face conversation of humans.  STCTracker, our face pose tracker, has been developed for measuring the position and pose of faces of meeting participants. Up to now, other methods such as magnetic sensors have been used for analyzing meetings. But image-based face tracking is acknowledged as the best solution because it places fewer constraints on the users, compared to sensors that are attached to the users.


STCTracker (Sparse Template Condensation Tracker), which was proposed in this paper, is an extension of a method developed in the Shakunaga Laboratory of Okayama University. The main feature of this method is its combination of sparse template matching and particle filtering. The sparse template matching uses a set of feature points on template regions, which differs from traditional template matching that uses all pixels in the template region. This sparseness contributes to enhanced matching speeds. Also, the robust matching and particle filtering realizes robust tracking, although it does not require precise face models nor complex feature extraction of facial parts. FYI, another tracker, which uses a sparse set of points for image matching, has been proposed in CMU.



An example of sparse template


We focused on sparse template condensation tracking, and newly incorporate a 3-D face model and its automatic initialization process, with GPU-based acceleration of the particle filtering. Among them, the key feature is the GPU-based acceleration. The particle filter is known to be a robust method of state estimation but it is computationally expensive. This has hampered the wide usage of particle filtering in realtime tracking applications.


Particle filtering is an algorithm for sequential state estimation of the target state such as position and pose. In particle filtering, the probability density distribution of the target state is represented by a set of particles. The posterior density of the target state for a given input image is calculated and represented as a set of particles. In other words, a particle is an hypothesis of the target state, and each hypothesis is evaluated by assessing how well the hypothesis fits the current input data. Depending on the scores of the hypotheses, the set of hypotheses is updated and regenerated in the next time step. Since each evaluation of the hypothesis (particle) is an independent process, particle filtering is suitable for parallel processing. However, even using current multicore CPUs, the speed boost is at most double to quadruple. Therefore, we focused on GPUs (Graphics Processing Units), which can offer 10 times more floating point calculations than modern CPUs, and are being advanced more rapidly than GPUs. In recent year, GPGPU (General Purpose GPU) movement has been gaining attention. Especially, NVIDIA’s CUDA has been on the rise, because it lowers the barriers for GPU Computing since it is programmable in C language. In February, 2007, we employed CUDA0.8 and developed GPU-based STCTracker. I think that this work is one among the earliest examples of CUDA-based particle filtering. So far, some implementations of particle filter using GPUs have been proposed, but these are based on a traditional GPGPU framework.


Key point of acceleration using GPU


STCTracker utilizes dual independent calculation tracks, one is particle-wise independency and the other is the matching process of each feature point on a template. We turned this dual independency into dual parallel computations in GPU, by using the hierarchical structure of NVIDIA’s G80 GPU. More precisely, G80 (GeForce8800GTX) consists of 128 stream processors; 8 of them are packed to form a multiprocessor, as shown in below figure.

Hardware model of G80 GPU by CUDA

Cited from NVIDIA CUDA Compute Unified Device Architecture  Programming Guide Version 1.1, p. 14


One GPU core consists of 16 multiprocessors. The 8 stream processors in each multiprocessor can access the same shared memory. Direct communication between different multiprocessors is not allowed. STCTracker executes image matching between a template of a particle and the input image in a single multiprocessor. This process involves calculating the difference in pixel intensity between template and the input image. This matching process of each point in a template is an independent process, and the computation can be distributed over the stream processors in the multiprocessors. The matching results are stored in the shared memory, and are finally summed up for calculating the total matching error (eventually yields particle’s weight) of a particle. Each particle-based computation is distributed over multiple multiprocessors. Therefore, the dual parallelism in STCTracker realizes effective usage of the GPU's power. Note that the input image is sent to the GPU’s texture memory from host memory; this transfer is currently a bottleneck.


Flow of STCTracker


Above figure shows the flow of STCTracker; it consists of initialization part (Left) and particle filter part (Right). The initialization part detects frontal faces in the image and builds the face shape model that fits to each person’s face. Also, feature points are detected in the facial region. From the feature points and face shape model, a sparse template (face mode) is formed. Specifically, the template consists of a set of feature points with coordinates (x,y,z) and intensity values. Here, z value is obtained from the shape model yielded by morphing a rough model to fit the target face.


Next, the particle filter part alternately executes update and diffusion. The state of a particle is a 7-dimension vector consisting of position, pose, scale, and illumination coefficient. Update stage calculates a particle’s weight by matching the template of each particle against the input image. The diffusion stage resamples the particle depending on its weights, and diffuses the particle set according to a motion model of the target, e.g. random walk model. Display part calculates point statistics of the particle distribution to estimate the target state.




We have applied the STCTracker to conversation scene analysis. Our latest system demonstrates the impact of the realtime tracking of people’s faces in meetings. This page overviews the system and offers many sample movies. This system uses two cameras equipped with fisheye lenses, and transforms the fisheye images into panorama images for face tracking on GPUs.


As another example, STCTracker can be extended to facial expression recognition. To that end, in corporation with the University of Tokyo, we have developed a new facial expression model, called variable intensity template, which is a novel extension of the sparse template. With this model, facial expression can be recognized even when the face is not directed toward the cameras, i.e. it is robust against head pose changes. See the details in Shiro Kumano’s home page.



Back to home


All rights reserved, Copyright(C) 2005, 2006, 2007, 2008 NTT Communication Science Laboratories

Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.