RO CV LGJan 3, 2025

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren

arXiv:2501.01895v331.455 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses robotics manipulation for real-world applications, combining multiple incremental innovations into a comprehensive system.

The paper tackles the problem of robotics manipulation by introducing EnerVerse, a generative foundation model that constructs and interprets embodied spaces to translate 4D world representations into physical actions, achieving state-of-the-art performance in simulation and real-world tasks with about 280 ms per 8-step action chunk on a single RTX 4090.

We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs a chunk-wise autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we adopt a multi-view video representation, providing rich perspectives to address challenges like motion ambiguity and 3D grounding. Additionally, EnerVerse-D, a data engine pipeline combining generative modeling with 4D Gaussian Splatting, forms a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), achieving state-of-the-art performance in both simulation and real-world tasks. For efficiency, EnerVerse-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. Further video demos, dataset samples could be found in our project page.

View on arXiv PDF

Similar