ROAIOct 31, 2024

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

arXiv:2411.00081v167 citationsh-index: 45Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for better planning and reasoning in embodied multi-agent systems, particularly for household robotics, and is incremental as it builds on existing LLM and benchmark methods.

The authors tackled the problem of human-robot coordination in household activities by creating PARTNR, a benchmark with 100,000 natural language tasks, and found that state-of-the-art LLMs paired with humans require 1.5x more steps than two humans collaborating and that fine-tuning smaller LLMs can match larger models while being 8.6x faster.

We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We employ a semi-automated task generation pipeline using Large Language Models (LLMs), incorporating simulation in the loop for grounding and verification. PARTNR stands as the largest benchmark of its kind, comprising 100,000 natural language tasks, spanning 60 houses and 5,819 unique objects. We analyze state-of-the-art LLMs on PARTNR tasks, across the axes of planning, perception and skill execution. The analysis reveals significant limitations in SoTA models, such as poor coordination and failures in task tracking and recovery from errors. When LLMs are paired with real humans, they require 1.5x as many steps as two humans collaborating and 1.1x more steps than a single human, underscoring the potential for improvement in these models. We further show that fine-tuning smaller LLMs with planning data can achieve performance on par with models 9 times larger, while being 8.6x faster at inference. Overall, PARTNR highlights significant challenges facing collaborative embodied agents and aims to drive research in this direction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes