Spatial Competence Benchmark
This work addresses the need for more comprehensive spatial evaluation benchmarks in AI, though it is incremental as it builds on existing probing methods by adding hierarchical tasks and verifiers.
The authors tackled the problem of evaluating spatial competence in large models by introducing the Spatial Competence Benchmark (SCBench), which spans hierarchical tasks with executable outputs, and found that frontier models show decreasing accuracy as task complexity increases, with accuracy gains saturating quickly at low token budgets.
Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.