SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion
This work addresses a specific problem in human motion generation for computer graphics or robotics, but it is incremental as it builds on existing text-to-motion models by adding scene-awareness through adaptation.
The paper tackles the problem of generating human motion that is both text-conditioned and scene-aware, which is challenging due to the lack of large-scale datasets combining both aspects. It introduces SceneAdapt, a framework that uses motion inbetweening as a proxy task to bridge disjoint datasets, resulting in effective injection of scene awareness into text-to-motion models.
Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.