When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
For embodied AI researchers, this benchmark and agent address the overlooked problem of long-horizon planning in household tasks, but the results show current methods are far from solving it.
The paper introduces LongAct, a benchmark for long-horizon household tasks with free-form instructions, and HoloMind, a VLM-driven agent that improves performance. Top models achieve only 59% goal completion and 16% full-task success, highlighting the challenge.
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.