AICVApr 6, 2024

Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

arXiv:2404.04619v124 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the inefficiency and flexibility issues in embodied AI systems for robotics and simulation applications, though it appears incremental as it builds on existing multi-modal language models.

The paper tackles the problem of complex agent systems for open-ended embodied tasks by proposing STEVE-2, a hierarchical knowledge distillation framework that distills agents into a single model, resulting in performance improvements of 1.4x to 7.3x on navigation and creation tasks.

With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes