Pack and Force Your Memory: Long-form and Consistent Video Generation
This work solves the problem of generating consistent long videos for applications in video synthesis, though it appears incremental in improving existing autoregressive models.
The paper tackles the challenge of long-form video generation by addressing long-range dependencies and error accumulation in autoregressive decoding, resulting in minute-level temporal consistency and improved reliability.
Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.