SD AI ASOct 13, 2025

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustave Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin

arXiv:2510.12834v14.0h-index: 30

Originality Incremental advance

AI Analysis

This addresses the challenge of generating synchronized speech and gestures for human-computer interaction, though it appears incremental as it builds on existing multimodal synthesis methods.

The paper tackled the problem of synthesizing speech and co-speech gestures from text by introducing Gelina, a unified framework that uses interleaved token sequences to jointly generate both modalities, resulting in competitive speech quality and improved gesture generation compared to unimodal baselines.

Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

View on arXiv PDF

Similar