Compositional Structure Learning for Action Understanding
This addresses the need for more detailed action analysis in applications like mobile robotics and video search, representing an incremental improvement over existing methods.
The paper tackles the problem of richer action understanding beyond classification by proposing a compositional model with a new mid-level representation and a structured parts model, achieving state-of-the-art performance on action detection, localization, and recognition.
The focus of the action understanding literature has predominately been classification, how- ever, there are many applications demanding richer action understanding such as mobile robotics and video search, with solutions to classification, localization and detection. In this paper, we propose a compositional model that leverages a new mid-level representation called compositional trajectories and a locally articulated spatiotemporal deformable parts model (LALSDPM) for fully action understanding. Our methods is advantageous in capturing the variable structure of dynamic human activity over a long range. First, the compositional trajectories capture long-ranging, frequently co-occurring groups of trajectories in space time and represent them in discriminative hierarchies, where human motion is largely separated from camera motion; second, LASTDPM learns a structured model with multi-layer deformable parts to capture multiple levels of articulated motion. We implement our methods and demonstrate state of the art performance on all three problems: action detection, localization, and recognition.