Lifelong Learning of Video Diffusion Models From a Single Video Stream
This work addresses the challenge of lifelong learning for video generation in embodied agents, offering a practical approach with incremental improvements over standard methods.
The paper tackles the problem of training autoregressive video diffusion models from a single continuous video stream, showing it can match offline training effectiveness with the same gradient steps and using experience replay with a subset of data. The result is supported by introducing four new datasets, each with one million consecutive frames, for evaluation in this lifelong learning setting.
This work demonstrates that training autoregressive video diffusion models from a single video stream$\unicode{x2013}$resembling the experience of embodied agents$\unicode{x2013}$is not only possible, but can also be as effective as standard offline training given the same number of gradient steps. Our work further reveals that this main result can be achieved using experience replay methods that only retain a subset of the preceding video stream. To support training and evaluation in this setting, we introduce four new datasets for streaming lifelong generative video modeling: Lifelong Bouncing Balls, Lifelong 3D Maze, Lifelong Drive, and Lifelong PLAICraft, each consisting of one million consecutive frames from environments of increasing complexity.