CVNov 25, 2025

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

arXiv:2511.20272v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of developing more generalizable MLLMs that can understand visual knowledge, which is crucial for applications in AI and robotics, though it is incremental as it builds on existing MLLM frameworks.

The paper tackles the problem of evaluating and improving visual knowledge understanding in multimodal large language models (MLLMs), which often lack human-like understanding of physical and social principles. It introduces the VKnowU benchmark, showing that leading models fall short of human performance, and proposes VideoKnow+, a baseline model that achieves a +3.7% improvement on VKnowU and gains on other benchmarks.

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes