CVNov 28, 2023

AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

arXiv:2311.16468v167 citationsh-index: 4
Originality Highly original
AI Analysis

This work addresses the problem of siloed models in human motion research for AI and robotics applications, representing a novel integration rather than an incremental improvement.

The paper tackles the fragmentation in human motion tasks by introducing AvatarGPT, an all-in-one framework that unifies motion understanding, planning, and generation using a shared large language model, achieving state-of-the-art results on low-level tasks and promising outcomes on high-level tasks, including enabling unlimited long-motion synthesis through iterative traversal.

Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. Inspired by InstuctGPT, and the generalist concept behind Gato, we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface, constituting a closed-loop within the framework. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vocabulary of LLM. Then, an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes