AIMay 9

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

arXiv:2605.0874782.9
AI Analysis

For embodied AI evaluation, VIGIL provides a protocol to independently measure terminal commitment, a previously conflated capability, exposing systematic failures in current agents.

Standard embodied evaluations conflate task completion with correct termination. VIGIL decouples world-state completion (W) from benchmark success (B) by requiring a correct terminal report, revealing that models with similar W differ by up to 19.7 pp in B, and that action feedback improves W but not commitment failures.

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes