CVDec 4, 2025Code
EgoLCD: Egocentric Video Generation with Long Context DiffusionLiuzhou Zhang, Jiarui Ye, Yuanlei Wang et al.
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
CVJul 26, 2024
IOVS4NeRF:Incremental Optimal View Selection for Large-Scale NeRFsJingpeng Xie, Shiyu Tan, Yuanlei Wang et al.
Large-scale Neural Radiance Fields (NeRF) reconstructions are typically hindered by the requirement for extensive image datasets and substantial computational resources. This paper introduces IOVS4NeRF, a framework that employs an uncertainty-guided incremental optimal view selection strategy adaptable to various NeRF implementations. Specifically, by leveraging a hybrid uncertainty model that combines rendering and positional uncertainties, the proposed method calculates the most informative view from among the candidates, thereby enabling incremental optimization of scene reconstruction. Our detailed experiments demonstrate that IOVS4NeRF achieves high-fidelity NeRF reconstruction with minimal computational resources, making it suitable for large-scale scene applications.
CVNov 22, 2025
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic BridgingMing Zhong, Yuanlei Wang, Liuzhou Zhang et al.
While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities. The project page is at https://vcu-bridge.github.io .