SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
For VLM researchers, this benchmark exposes the limitation of current models in inferring occluded object locations from task context and commonsense, motivating future work on integrated reasoning.
SceneFunRI introduces a benchmark for reasoning about invisible functional object locations in 3D scenes, showing that current VLMs (e.g., Gemini 3 Flash) achieve only 15.20% CAcc@75, highlighting a major gap in invisible-region reasoning.
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.