SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

arXiv:2605.1470445.7

AI Analysis

For VLM researchers, this benchmark exposes the limitation of current models in inferring occluded object locations from task context and commonsense, motivating future work on integrated reasoning.

SceneFunRI introduces a benchmark for reasoning about invisible functional object locations in 3D scenes, showing that current VLMs (e.g., Gemini 3 Flash) achieve only 15.20% CAcc@75, highlighting a major gap in invisible-region reasoning.

In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

View on arXiv PDF

Similar