AIMay 11

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

arXiv:2605.0986095.9
Predicted impact top 9% in AI · last 90 daysOriginality Highly original
AI Analysis

For long-horizon reasoning in vision-language models, this work addresses the fixed-commitment bottleneck by enabling adaptive replanning, yielding substantial gains over fixed baselines.

The paper introduces a method for learning state-conditioned commitment depth in vision-language policies, achieving up to 12.5% higher solve rates and 25% fewer actions on long-horizon tasks, outperforming GPT-5.5 and Claude Sonnet.

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes