SentiAvatar: Towards Expressive and Interactive Digital Humans

Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, Ruihua Song

arXiv:2604.0290885.61 citationsh-index: 6

Predicted impact top 27% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work solves the problem of building realistic virtual characters for applications like gaming or virtual assistants, though it is incremental as it builds on existing motion generation methods.

The authors tackled the challenge of creating expressive and interactive 3D digital humans by developing SentiAvatar, a framework that addresses data scarcity, semantic mapping, and motion-prosody synchronization, resulting in state-of-the-art performance with metrics like R@1 43.64% and FGD 4.941, and generating 6 seconds of output in 0.3 seconds.

We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.

View on arXiv PDF

Similar