CVDec 4, 2025

EgoLCD: Egocentric Video Generation with Long Context Diffusion

arXiv:2512.04515v14 citationsh-index: 7Has Code
Originality Highly original
AI Analysis

This solves the challenge of reliable long-term memory for hand-object interactions and procedural tasks in egocentric video generation, representing a significant step toward scalable world models for embodied AI.

The paper tackled the problem of generating long, coherent egocentric videos by addressing content drift in existing models, and EgoLCD achieved state-of-the-art performance on the EgoVid-5M benchmark in perceptual quality and temporal consistency.

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes