CVApr 10, 2025

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho, Jaewook Yoo, Chen Sun

arXiv:2504.07454v26 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of enhancing video-language understanding for AI researchers, but it is incremental as it builds on existing MLLM frameworks by testing object integration methods.

The paper tackles whether explicit object representation is needed in multimodal large language models (MLLMs) for video-language understanding, and finds that integrating object-centric representations improves performance on video question answering benchmarks, with a simple text quantization method proving most effective.

Do we still need to represent objects explicitly in multimodal large language models (MLLMs)? To one extreme, pre-trained encoders convert images into visual tokens, with which objects and spatiotemporal relationships may be implicitly modeled. To the other extreme, image captions by themselves provide strong empirical performances for understanding tasks, despite missing fine-grained spatiotemporal information. To answer this question, we introduce ObjectMLLM, a framework capable of leveraging arbitrary computer vision algorithm to extract and integrate structured visual representation. Through extensive evaluations on six video question answering benchmarks, we confirm that explicit integration of object-centric representation remains necessary. Surprisingly, we observe that the simple approach of quantizing the continuous, structured object information and representing them as plain text performs the best, offering a data-efficient approach to integrate other visual perception modules into MLLM design. Our code and models are released at https://github.com/brown-palm/ObjectMLLM.

View on arXiv PDF Code

Similar