CVDec 15, 2025

MoLingo: Motion-Language Alignment for Text-to-Motion Generation

arXiv:2512.13840v12 citationsh-index: 27
Originality Incremental advance
AI Analysis

This addresses the problem of generating realistic human motions from text descriptions for applications like animation and robotics, representing an incremental improvement over existing methods.

The paper tackled text-to-motion generation by proposing MoLingo, which uses a semantic-aligned latent space and cross-attention text conditioning to improve motion realism and alignment, achieving state-of-the-art results on standard metrics and in a user study.

We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text-motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes