CVDec 4, 2025

EgoLCD: Egocentric Video Generation with Long Context Diffusion

Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang

arXiv:2512.04515v110.24 citationsh-index: 7Has Code

Originality Highly original

AI Analysis

This solves the challenge of reliable long-term memory for hand-object interactions and procedural tasks in egocentric video generation, representing a significant step toward scalable world models for embodied AI.

The paper tackled the problem of generating long, coherent egocentric videos by addressing content drift in existing models, and EgoLCD achieved state-of-the-art performance on the EgoVid-5M benchmark in perceptual quality and temporal consistency.

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

View on arXiv PDF Code

Similar