CVDec 21, 2025

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

arXiv:2512.18741v16 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the trade-off between memory efficiency and consistency in real-time video generation for applications like interactive world models and game engines, representing an incremental improvement over existing methods.

The paper tackles the problem of catastrophic forgetting and scene inconsistency in long video generation by proposing the Memorize-and-Generate (MAG) framework, which decouples memory compression and frame generation, achieving superior historical scene consistency while maintaining competitive performance on benchmarks.

Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes