CVAug 6, 2024

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

arXiv:2408.02879v110 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the gap in realistic humanoid agents for interactive virtual interfaces, but it is a preliminary exploration with incremental contributions.

The authors tackled the problem of creating a realistic interactive humanoid agent by proposing an end-to-end network that integrates audio and visual inputs, extended from a pre-trained LLM, using approximately 200,000 hours of audio, 130,000 hours of video, and 20,000 alignment samples, resulting in capabilities like generalized object manipulation that were difficult in prior systems.

Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively complete humanoid agent first needs to have face and body, then possess both verbal and non-verbal (such as eye contact, facial expression, lip motion, gesture, and manipulation) abilities, and finally, it is capable of real-time duplex communication, e.g., the ability to actively interrupt conversations. Most prior systems typically only consider a subset of these elements, leaving a gap from realistic humanoid agent. In this work, we propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors, including speech, full-body movements for talking, responding, idling, and manipulation. This system is a multimodal model integrating audio and visual inputs, extended from a pre-trained large language model (LLM). We collect approximately 200,000 hours of audio, around 130,000 hours of video data, and about 20,000 alignment samples to build the model. The final model demonstrates capabilities that are difficult to achieve in previous systems, such as generalized object manipulation. This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes