Small Vision-Language Models are Smart Compressors for Long Video Understanding
This addresses the challenge of long-form video understanding for AI systems, offering a novel compression method to handle dense visual streams efficiently, though it is incremental in improving existing MLLM frameworks.
The paper tackles the problem of adapting Multimodal Large Language Models (MLLMs) to hour-long videos, which is bottlenecked by context limits, by proposing Tempo, an efficient query-aware framework that compresses long videos for downstream understanding. It achieves state-of-the-art performance, scoring 52.3 on LVBench (4101s) under an 8K visual budget and scaling to 53.7 with 2048 frames, outperforming models like GPT-4o and Gemini 1.5 Pro.
Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.