CVNov 22, 2025

InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

arXiv:2511.18200v24 citations
Originality Incremental advance
AI Analysis

This addresses the problem of limited and non-customizable evaluation for vision-language models in spatial reasoning, enabling better analysis of failure modes, though it is incremental as it builds on existing procedural and LLM-based methods.

The paper tackles the lack of customizable benchmarks for evaluating visual spatial reasoning in vision-language models by introducing InfiniBench, a benchmark generator that synthesizes infinite 3D scenes with parameterized complexity, achieving state-of-the-art performance in prompt fidelity and physical plausibility, especially in high-complexity scenarios.

Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes