OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
This work provides a more scalable and adaptive memory retrieval framework for researchers and developers working on long video generation, addressing the challenge of maintaining explicit access to historical details without excessive memory cost.
This paper tackles the problem of scaling autoregressive video generation to long videos, which typically requires repeated access to a growing historical KV cache. The proposed method, OmniMem, uses sparse KV retrieval over the full historical cache, improving Dynamic Degree by 52.3% and maintaining strong consistency compared to baselines, while keeping memory usage comparable.
Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.