Video Occupancy Models
This work addresses the need for efficient video prediction models in robotics or autonomous systems, though it appears incremental as it builds on prior latent-space world models.
The paper tackles the problem of video prediction for control tasks by introducing Video Occupancy Models (VOCs), which operate in a latent space and predict future state distributions in a single step, avoiding pixel-level predictions and multi-step rollouts, resulting in benefits for downstream control applications.
We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.