AIROAug 10, 2024

Multi-Agent Planning Using Visual Language Models

arXiv:2408.05478v211 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of integrating planning and perception in free-form domains for AI systems, though it appears incremental as it builds on existing multi-agent and VLM approaches.

The paper tackles the problem of LLMs and VLMs struggling with multi-modal planning by proposing a multi-agent architecture that uses a single image for embodied task planning without specialized data structures, validated on the ALFRED dataset with a new evaluation procedure PG2S.

Large Language Models (LLMs) and Visual Language Models (VLMs) are attracting increasing interest due to their improving performance and applications across various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the problem domain is required. For instance, when planning and perception are needed simultaneously, these models often struggle because of difficulties in merging multi-modal information. To address this issue, fine-tuned models are typically employed and trained on specialized data structures representing the environment. This approach has limited effectiveness, as it can overly complicate the context for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without the need for specific data structures as input. Instead, it uses a single image of the environment, handling free-form domains by leveraging commonsense knowledge. We also introduce a novel, fully automatic evaluation procedure, PG2S, designed to better assess the quality of a plan. We validated our approach using the widely recognized ALFRED dataset, comparing PG2S to the existing KAS metric to further evaluate the quality of the generated plans.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes