CVAIMay 10

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

arXiv:2605.1636671.4
AI Analysis

This addresses the efficiency-accuracy trade-off in video MLLMs for practitioners needing to process long or dense video streams with limited compute.

Fre-Res introduces a dual-track video token compression framework that separates spatial anchors from temporal residual-frequency tokens, achieving comparable or better accuracy than full-token methods while significantly reducing token length across fine-grained short- and long-video reasoning benchmarks.

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes