LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
For robotic manipulation researchers, LongBench provides a mechanism-aware benchmark to disentangle sources of temporal difficulty in real-world long-horizon tasks, addressing a gap in existing benchmarks.
LongBench is a real-world benchmark with over 1,000 episodes for evaluating long-horizon robotic manipulation, revealing that performance degrades due to distinct factors: execution robustness in fully observable tasks and contextual difficulty in ambiguity-driven tasks, with memory-based methods not consistently improving the latter.
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor. We observe that performance in fully observable settings is more strongly associated with execution robustness, while contextual difficulty varies across tasks and is not consistently improved by memory-based methods. We hope that LongBench serves as a useful benchmark for studying long-horizon manipulation and for developing policies with stronger robustness across both execution and contextual challenges.