CLJul 22, 2025

Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models

arXiv:2507.16572v11 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable physical reasoning in AI systems for researchers and developers, highlighting an incremental insight into model failures.

The paper evaluated multimodal large language models on intuitive physics tasks, finding that even state-of-the-art models struggle to distinguish plausible from implausible scenarios, with a critical vision-language misalignment identified as the primary limitation.

This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes