Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos
This addresses the problem of semantic gaps in video-language alignment for applications like video understanding, though it appears incremental as it builds on existing methods with a novel refinement approach.
The paper tackles the challenge of aligning vision and language in videos by introducing Planner-Refiner, a framework that iteratively refines visual representations based on language guidance, achieving superior performance on tasks like Referring Video Object Segmentation and Temporal Grounding, especially with complex prompts.
Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.