CVNov 20, 2024

On the Consistency of Video Large Language Models in Temporal Comprehension

arXiv:2411.12951v213 citationsh-index: 17Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses the robustness and trustworthiness of Video-LLMs for temporal grounding, which is crucial for applications like video retrieval and analysis, but it is incremental as it focuses on evaluating and improving existing methods rather than introducing a new paradigm.

The study investigated the consistency of Video Large Language Models (Video-LLMs) in temporal comprehension, revealing that current models are highly sensitive to variations and exhibit severe deficiencies in maintaining consistency, with proposed event temporal verification tuning showing significant improvements in grounding and consistency.

Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code are open-sourced at https://github.com/minjoong507/Consistency-of-Video-LLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes