CVCYMay 14

ViMU: Benchmarking Video Metaphorical Understanding

arXiv:2605.1460786.5
Predicted impact top 20% in CV · last 90 daysOriginality Highly original
AI Analysis

This benchmark addresses the critical gap in video understanding for AI systems, particularly for social and cultural subtext, which is essential for applications like content moderation and human-computer interaction.

ViMU introduces the first benchmark for evaluating video understanding models' ability to infer implicit meanings (e.g., metaphor, irony) beyond literal content, using hint-free questions. Results show current models significantly lag behind human performance, with the best model achieving only 56.2% accuracy on multiple-choice questions.

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes