CVJun 26, 2024

Chrono: A Simple Blueprint for Representing Time in MLLMs

arXiv:2406.18113v613 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of contextual and temporal comprehension in video-language models, which is crucial for applications like video retrieval and question answering, though it appears incremental as it builds on existing MLLM architectures.

The authors tackled the problem of temporal localization in videos by introducing Chrono, a simple blueprint for representing time in multimodal large language models, achieving new state-of-the-art results on benchmarks like Charades-STA, QVHighlights, and ActivityNet Captions.

The recent success of Large Language Models (LLMs) has prompted the extension to the multimodal domain, developing image-text Multimodal LLMs (MLLMs) and then video-text models. In this work, we investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos. To address this problem, prior works have developed complex task-specific architectures, novel modules to embed time into MLLMs, or leveraged additional input signals such as video transcripts to best encode contextual and temporal information. We find that most of these efforts are surpassed by a much simpler design. We introduce Chrono, a universal sequence blueprint that can be applied to any image-text pretrained MLLM. In extensive experiments spanning different MLLM architectures and sizes, finetuning and zero-shot settings, we demonstrate new state-of-the-art results in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions, as well as in grounded video question answering on NExT-GQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes