CVDec 15, 2024

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout

arXiv:2412.11198v130.472 citationsh-index: 11Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses the need for fine-grained control in ego-vision applications like autonomous driving and human activities, representing an incremental advancement with novel components such as autoregressive noise schedules.

The paper tackles the problem of generating controllable future frames in ego-vision scenarios by introducing GEM, a multimodal world model that predicts RGB and depth outputs with precise control over object dynamics, ego-motion, and human poses, achieving stable long-horizon generations and excelling in diverse, controllable scenarios as shown through experiments.

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

View on arXiv PDF Code

Similar