CVMar 18, 2025

Improving LLM Video Understanding with 16 Frames Per Second

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang

arXiv:2503.13956v221 citationsh-index: 17Has CodeICML

Originality Highly original

AI Analysis

This work addresses the limitation of low frame rates in video LLMs for researchers and practitioners, offering a new approach to enhance video understanding without scaling model size or data, though it is incremental as it builds on existing multimodal LLM frameworks.

The paper tackles the problem of visual information loss in video understanding with multimodal LLMs by introducing F-16, a model that increases the frame rate to 16 FPS and compresses visual tokens, achieving state-of-the-art performance on benchmarks like Video-MME and TemporalBench and outperforming proprietary models in complex spatiotemporal tasks.

Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data at \href{https://github.com/bytedance/F-16}{https://github.com/bytedance/F-16}.

View on arXiv PDF Code

Similar