CVApr 13

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

arXiv:2604.1173778.9h-index: 15
Predicted impact top 30% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the inefficiency of exploring multiple future motions via full video synthesis, offering a more efficient alternative for motion generation tasks.

The paper introduces a method for generating long-term motion embeddings with a 64x temporal compression factor, enabling efficient kinematics generation that outperforms state-of-the-art video models and task-specific approaches in motion realism and goal fulfillment.

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models exhibit strong comprehension of scene dynamics, exploring multiple possible futures through full video synthesis remains prohibitively inefficient. We model scene dynamics orders of magnitude more efficiently by directly operating on a long-term motion embedding that is learned from large-scale trajectories obtained from tracker models. This enables efficient generation of long, realistic motions that fulfill goals specified via text prompts or spatial pokes. To achieve this, we first learn a highly compressed motion embedding with a temporal compression factor of 64x. In this space, we train a conditional flow-matching model to generate motion latents conditioned on task descriptions. The resulting motion distributions outperform those of both state-of-the-art video models and specialized task-specific approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes