Now you are here: Home > Research Interests > Saliency-based video segmentation with sequentially updated priors [ English ] [ Japanese ]

Saliency-based video segmentation with sequentially updated priors
  • Our method enables us to automatically detect and segment object-like regions from videos without any manually annotated labels.
  • We utilize visual saliency as a prior distribution of region segmentation instead of manually annotated labels.
  • A prior distribution for every frame is updated with previous segmentation results, combined with prior information coming from visual saliency.
  • We introduce CUDA implementation to accelerate the computation of prior distributions and feature likelihoods, resulting in achieving around 10fps in a mobile PC with CUDA-compatible graphics boards.

This dataset contains 10 videos as inputs, and segmented image sequences as ground-truth.

Any report or publication using this data should cite its use as the 3 publications from the top listed in "Selected publications" below.

Detailed description:
Videos : 10 uncompressed AVI clips of natural scenes with 12 fps, including at least one target objects or something others. Length varies 5-10 seconds.
Groung-truth: 10 sets of JPEG images, each corresponds to an input video. Segmented images are provided for almost all the frames exculding first 15 frames.

(Top left) Input video (Top right) Visual attention density
(Bottom left) Priors for segmentation (Bottom right) Segmentation result

Selected publications

Ken Fukuchi, Kouji Miyazato, Akisato Kimura, Shigeru Takagi and Junji Yamato
"Saliency-based video segmentation with graph cuts and sequentially updated priors,"
Proc. International Conference on Multimedia and Expo (ICME2009),
pp.638--641, New York, New York, USA, June-July 2009.
[ bibliography ]

Kazuma Akamine, Ken Fukuchi, Akisato Kimura, Shigeru Takagi
"Fully automatic extraction of salient regions in near real-time,"
the Computer Journal, doi:10.1093/comjnl/bxq075.
[ abstract ]


Extracting important (or meaningful) regions from videos is not only a challenging problem in computer vision research but also a crucial task in many applications including object recognition, video classification, annotation and retrieval. It can be formulated as a problem of binary segmentation, where important regions are considered ``objects'' and the remaining regions ``backgrounds''. One of the most promising ways to achieve precise segmentation is the method proposed by Boykov et al. called Interactive Graph Cuts. This method originated in the work of Greig et al., where the exact maximum a posteriori (MAP) solution of a two label pairwise Markov random field (MRF) can be obtained by finding the minimum cut on the equivalent graph of the MRF. Boykov et al. extended this work to MRFs with multiple labels, and applied it to interactive image segmentation. Interactive Graph Cuts has become a defacto standard of interactive image segmentation in recent years. More recently, several approaches for extending it to video segmentation have been proposed. For example, Kohli and Torr described an efficient algorithm for computing MAP estimates for dynamically changing MRF models, and tested its performance on the video segmentation problem.

Although the above approaches are promising, they all pose a critical problem in that they have to provide segmentation cues (seeds) manually and carefully. Such manual labeling is occasionally infeasible, especially when we consider extending those methods to certain other applications. The development of fully automatic segmentation methods has been strongly expected. The use of saliency-based human visual attention models is one of the most promising approaches in this respect. The first biologically plausible model for explaining the human attention system was proposed by Koch and Ullman, and later implemented by Itti et al. This model analyzes still images to produce primary visual features, such as intensity, color and orientation, which are combined to form a saliency map that represents the relevance of visual attention. Also, Pang et al. first proposed a stochastic model for estimating human visual attention that tackled the fundamental problem of the previous attention models related to the non-deterministic properties of the human visual system. Such models would be helpful for automatically providing segmentation seeds.

In line with the above viewpoint, we propose a novel approach for achieving video segmentation based on visual saliency. Our main contributions are as follows:

  1. We introduce MAP-based framewise segmentation with graph cuts where priors for segmentation are provided based on visual saliency. This approach is closely related to the work undertaken by Fu et al. for still image segmentation. However, when applying this idea to video signals, segmentation results are sometimes unstable since areas in which visual saliency takes a large value may change according to time and circumstances.
  2. Therefore, we also develop a new technique for estimating and updating priors and feature likelihoods. We integrate the prior derived from the segmentation results for the previous frames and the prior derived from the saliency at the current frame by using a Kalman filter, where the prior from the saliency at the current frame is supposed to be the observation. The feature likelihood can be also estimated by combining two feature likelihoods, one obtained from the segmentation results for the previous frames and the other from the saliency calculation for the current frame.
  3. In addition, the whole procedure for segmentation can be executed within near real-time (about 5 fps @ 352x288 pixels) with stream processing through such as GPUs.