CVLGNov 25, 2024

VideoOrion: Tokenizing Object Dynamics in Videos

arXiv:2411.16156v213 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of information loss and entangled semantics in video tokenization for researchers and practitioners in video understanding, representing an incremental improvement over prior methods.

The paper tackles the challenge of efficiently compressing high-dimensional video data into semantic tokens for Video Large Language Models by introducing VideoOrion, which explicitly captures object dynamics through a detect-segment-track pipeline, achieving competitive results on video question answering and referring benchmarks.

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos - the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes