CVDec 30, 2015

Actor-Action Semantic Segmentation with Grouping Process Models

arXiv:1512.09041v143 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of understanding who is performing what action and where in videos, which is crucial for advanced video analysis, and it represents a significant incremental improvement over existing local models.

The paper tackles actor-action semantic segmentation in videos by proposing a dynamic model that combines local CRF labeling with hierarchical supervoxel grouping, achieving a 60% relative improvement over state-of-the-art methods on a large-scale dataset.

Actor-action semantic segmentation made an important step toward advanced video understanding problems: what action is happening; who is performing the action; and where is the action in space-time. Current models for this problem are local, based on layered CRFs, and are unable to capture long-ranging interaction of video parts. We propose a new model that combines these local labeling CRFs with a hierarchical supervoxel decomposition. The supervoxels provide cues for possible groupings of nodes, at various scales, in the CRFs to encourage adaptive, high-order groups for more effective labeling. Our model is dynamic and continuously exchanges information during inference: the local CRFs influence what supervoxels in the hierarchy are active, and these active nodes influence the connectivity in the CRF; we hence call it a grouping process model. The experimental results on a recent large-scale video dataset show a large margin of 60% relative improvement over the state of the art, which demonstrates the effectiveness of the dynamic, bidirectional flow between labeling and grouping.

View on arXiv PDF

Similar