Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
This addresses the efficiency-accuracy trade-off in video MLLMs for practitioners needing to process long or dense video streams with limited compute.
Fre-Res introduces a dual-track video token compression framework that separates spatial anchors from temporal residual-frequency tokens, achieving comparable or better accuracy than full-token methods while significantly reducing token length across fine-grained short- and long-video reasoning benchmarks.
Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.