CVOct 20, 2023

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

arXiv:2310.13255v257 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the problem of enabling more intuitive and understandable interactions for LLM-based embodied agents in open-world environments, representing an incremental advancement in multimodal AI for robotics.

The paper tackles the limitation of LLM-based embodied agents lacking visual perception in open worlds by proposing Steve-Eye, an end-to-end trained large multimodal model that integrates a visual encoder with an LLM, resulting in improved performance validated through extensive experiments on three new benchmarks.

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to "a blindfolded text-based game." Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model designed to address this limitation. Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback. In addition, we use a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, empowering our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out extensive experiments from a wide range of perspectives to validate our model's capability to strategically act and plan. Codes and datasets will be released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes