EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
This addresses robotics manipulation for real-world applications, combining multiple incremental innovations into a comprehensive system.
The paper tackles the problem of robotics manipulation by introducing EnerVerse, a generative foundation model that constructs and interprets embodied spaces to translate 4D world representations into physical actions, achieving state-of-the-art performance in simulation and real-world tasks with about 280 ms per 8-step action chunk on a single RTX 4090.
We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs a chunk-wise autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we adopt a multi-view video representation, providing rich perspectives to address challenges like motion ambiguity and 3D grounding. Additionally, EnerVerse-D, a data engine pipeline combining generative modeling with 4D Gaussian Splatting, forms a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), achieving state-of-the-art performance in both simulation and real-world tasks. For efficiency, EnerVerse-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. Further video demos, dataset samples could be found in our project page.