Lichi Li

AI
h-index10
6papers
15citations
Novelty48%
AI Score52

6 Papers

97.5AIJun 3Code
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Zhangchen Xu, Junda Chen, Yue Huang et al.

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

IRFeb 22, 2024Code
MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

Lichi Li, Zainul Abi Din, Zhen Tan et al.

In the evolving e-commerce field, recommendation systems crucially shape user experience and engagement. The rise of Consumer-to-Consumer (C2C) recommendation systems, noted for their flexibility and ease of access for customer vendors, marks a significant trend. However, the academic focus remains largely on Business-to-Consumer (B2C) models, leaving a gap filled by the limited C2C recommendation datasets that lack in item attributes, user diversity, and scale. The intricacy of C2C recommendation systems is further accentuated by the dual roles users assume as both sellers and buyers, introducing a spectrum of less uniform and varied inputs. Addressing this, we introduce MerRec, the first large-scale dataset specifically for C2C recommendations, sourced from the Mercari e-commerce platform, covering millions of users and products over 6 months in 2023. MerRec not only includes standard features such as user_id, item_id, and session_id, but also unique elements like timestamped action types, product taxonomy, and textual product attributes, offering a comprehensive dataset for research. This dataset, extensively evaluated across four recommendation tasks, establishes a new benchmark for the development of advanced recommendation algorithms in real-world scenarios, bridging the gap between academia and industry and propelling the study of C2C recommendations. Our experiment code is available at https://github.com/mercari/mercari-ml-merrec-pub-us and dataset at https://huggingface.co/datasets/mercari-us/merrec.

66.3SEMar 19
kRAIG: A Natural Language-Driven Agent for Automated DataOps Pipeline Generation

Rohan Siva, Kai Cheung, Lichi Li et al.

Modern machine learning systems rely on complex data engineering workflows to extract, transform, and load (ELT) data into production pipelines. However, constructing these pipelines remains time-consuming and requires substantial expertise in data infrastructure and orchestration frameworks. Recent advances in large language model (LLM) agents offer a potential path toward automating these workflows, but existing approaches struggle with under-specified user intent, unreliable tool generation, and limited guarantees of executable outputs. We introduce kRAIG, an AI agent that translates natural language specifications into production-ready Kubeflow Pipelines (KFP). To resolve ambiguity in user intent, we propose ReQuesAct (Reason, Question, Act), an interaction framework that explicitly clarifies intent prior to pipeline synthesis. The system orchestrates end-to-end data movement from diverse sources and generates task-specific transformation components through a retrieval-augmented tool synthesis process. To ensure data quality and safety, kRAIG incorporates LLM-based validation stages that verify pipeline integrity prior to execution. Our framework achieves a 3x improvement in extraction and loading success and a 25 percent increase in transformation accuracy compared to state-of-the-art agentic baselines. These improvements demonstrate that structured agent workflows with explicit intent clarification and validation significantly enhance the reliability and executability of automated data engineering pipelines.

CVMay 27, 2025
IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

Yifan Li, Yuhang Chen, Anh Dao et al.

Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.

RONov 21, 2025
IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Yifan Li, Lichi Li, Anh Dao et al.

While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

AIJun 3, 2025
Towards Generating Controllable and Solvable Geometry Problem by Leveraging Symbolic Deduction Engine

Zhuoxuan Jiang, Tianyang Zhang, Peiyan Peng et al.

Generating high-quality geometry problems is both an important and challenging task in education. Compared to math word problems, geometry problems further emphasize multi-modal formats and the translation between informal and formal languages. In this paper, we introduce a novel task for geometry problem generation and propose a new pipeline method: the Symbolic Deduction Engine-based Geometry Problem Generation framework (SDE-GPG). The framework leverages a symbolic deduction engine and contains four main steps: (1) searching a predefined mapping table from knowledge points to extended definitions, (2) sampling extended definitions and performing symbolic deduction, (3) filtering out unqualified problems, and (4) generating textual problems and diagrams. Specifically, our method supports to avoid inherent biases in translating natural language into formal language by designing the mapping table, and guarantees to control the generated problems in terms of knowledge points and difficulties by an elaborate checking function. With obtained formal problems, they are translated to natural language and the accompanying diagrams are automatically drew by rule-based methods. We conduct experiments using real-world combinations of knowledge points from two public datasets. The results demonstrate that the SDE-GPG can effectively generate readable, solvable and controllable geometry problems.