DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
For researchers in long video generation, DySink offers a solution to the bottleneck of adaptive long-range context retention, improving generation quality.
DySink addresses the problem of static early-frame sinks in autoregressive long video generation, which cause outdated context and sink collapse. By using dynamic retrieval of relevant historical frames and an anomaly gate, it improves dynamic degree and temporal quality over baselines.
Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.