A Roadmap for Embodied and Social Grounding in LLMs
This work outlines a foundational problem for advancing embodied AI and robotics by integrating LLMs more effectively, though it is conceptual and incremental in proposing a framework rather than empirical results.
The paper identifies a gap in grounding Large Language Models (LLMs) for robotics, arguing that current multimodal or bodily approaches are insufficient for true language understanding, and proposes a roadmap based on active bodily systems, temporal experience, and social skills to address this.
The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs' representations to the external world with multimodal approaches or with robots' bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.