Aiden Yiliu Li

AI
h-index18
4papers
4citations
Novelty50%
AI Score54

4 Papers

66.9CVMay 15Code
Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

Aiden Yiliu Li, Nels Numan, Anthony Steed

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.

AIFeb 2Code
Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Aiden Yiliu Li, Xinyue Hao, Shilong Liu et al.

Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.

23.5AIApr 14Code
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad et al.

Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present Avenir-UX, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, Avenir-UX grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of Avenir-UX and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/Avenir-UX

AIDec 1, 2025
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

Aiden Yiliu Li, Bizhi Yu, Daoan Lei et al.

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.