CVMay 25

Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution

arXiv:2605.2533399.0Has Code
Predicted impact top 2% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of state persistence in video world models for the video generation community, offering a practical training recipe that is incremental in nature.

ReMind introduces a framework to make video diffusion transformers use their KV-cache as dynamic memory for maintaining hidden states across interruptions, achieving best overall scores on STEVO-Bench and recovery tasks without catastrophic forgetting.

Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes