AIMay 11

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

Dominik Helfenstein, Marco Menner, Maximilian Triebel

arXiv:2605.1122320.0

Predicted impact top 29% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers evaluating VLM capabilities in interactive environments, this benchmark highlights a critical gap between reasoning and execution in complex physical reasoning tasks.

The paper introduces VLATIM, a benchmark for evaluating human-like logical problem-solving in point-and-click puzzle games. Results show that large proprietary models have superior planning but struggle with precise visual grounding, failing to achieve human-like performance.

Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

View on arXiv PDF

Similar