Vision-Centric 4D Occupancy Forecasting and Planning via Implicit Residual World Models
This addresses a computational bottleneck in end-to-end autonomous driving systems by improving efficiency and accuracy, though it is incremental as it builds on existing world model approaches.
The paper tackled the inefficiency of full scene reconstruction in vision-centric world models for autonomous driving by proposing IR-WM, which models only residual changes based on prior states, and it achieved top performance on the nuScenes benchmark for 4D occupancy forecasting and trajectory planning.
End-to-end autonomous driving systems increasingly rely on vision-centric world models to understand and predict their environment. However, a common ineffectiveness in these models is the full reconstruction of future scenes, which expends significant capacity on redundantly modeling static backgrounds. To address this, we propose IR-WM, an Implicit Residual World Model that focuses on modeling the current state and evolution of the world. IR-WM first establishes a robust bird's-eye-view representation of the current state from the visual observation. It then leverages the BEV features from the previous timestep as a strong temporal prior and predicts only the "residual", i.e., the changes conditioned on the ego-vehicle's actions and scene context. To alleviate error accumulation over time, we further apply an alignment module to calibrate semantic and dynamic misalignments. Moreover, we investigate different forecasting-planning coupling schemes and demonstrate that the implicit future state generated by world models substantially improves planning accuracy. On the nuScenes benchmark, IR-WM achieves top performance in both 4D occupancy forecasting and trajectory planning.