Video Primal Sketch: A Unified Middle-Level Representation for Video
This work provides a foundational representation for video analysis, potentially benefiting computer vision researchers and applications in action recognition, but it is incremental as it builds on existing models like sparse coding and FRAME.
The paper tackles the problem of representing videos with a unified middle-level representation called Video Primal Sketch (VPS), which integrates sparse coding and FRAME/MRF models to explicitly and implicitly represent motion patterns, and demonstrates its effectiveness through synthesis, reconstruction, and human perception experiments.
This paper presents a middle-level video representation named Video Primal Sketch (VPS), which integrates two regimes of models: i) sparse coding model using static or moving primitives to explicitly represent moving corners, lines, feature points, etc., ii) FRAME /MRF model reproducing feature statistics extracted from input video to implicitly represent textured motion, such as water and fire. The feature statistics include histograms of spatio-temporal filters and velocity distributions. This paper makes three contributions to the literature: i) Learning a dictionary of video primitives using parametric generative models; ii) Proposing the Spatio-Temporal FRAME (ST-FRAME) and Motion-Appearance FRAME (MA-FRAME) models for modeling and synthesizing textured motion; and iii) Developing a parsimonious hybrid model for generic video representation. Given an input video, VPS selects the proper models automatically for different motion patterns and is compatible with high-level action representations. In the experiments, we synthesize a number of textured motion; reconstruct real videos using the VPS; report a series of human perception experiments to verify the quality of reconstructed videos; demonstrate how the VPS changes over the scale transition in videos; and present the close connection between VPS and high-level action models.