88.1AIJun 1
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent TrajectoriesJiaming Wang, Ziteng Feng, Jiangtao Wu et al.
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.
ARFeb 17, 2023
HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level SynthesisZhigang Wei, Aman Arora, Ruihao Li et al.
Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The Verilog samples are generated with a variety of directives including loop unroll, loop pipeline and array partition to make sure optimized and realistic designs are covered. The total number of generated Verilog samples is nearly 9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we undertake case studies to perform power estimation and resource usage estimation with ML models trained with our dataset. All the codes and dataset are public at the github repo.We believe that HLSDataset can save valuable time for researchers by avoiding the tedious process of running tools, scripting and parsing files to generate the dataset, and enable them to spend more time where it counts, that is, in training ML models.
45.3ARMay 5
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite ComparisonRuihao Li, Andrew Jacob, Neeraja J. Yadwadkar et al.
Specialized accelerators dominate AI workloads, but CPUs remain critical for orchestrating these accelerators and running datacenter services. As a result, CPU performance increasingly shapes end-to-end system efficiency, making it necessary for benchmarks to reflect modern workloads and bottlenecks. However, it remains unclear how emerging CPU benchmark suites reflect these shifts. To address this, we present the first comprehensive characterization of SPEC CPU2026 across nine platforms spanning recent Intel, AMD, Ampere, and Nvidia processors. We find that, compared to SPEC CPU2017, SPEC CPU2026 increases instruction volume and memory footprint, and shifts pressure toward emerging bottlenecks, most notably higher instruction-cache stress. We next examine whether the full suite is necessary for architectural evaluation. Using clustering-based representativeness analysis, we identify that compact subsets of 4-5 workloads per group preserve 96.4-99.9% of full-suite behavior, substantially reducing evaluation costs without sacrificing fidelity. To better position SPEC CPU2026, we compare it against SPEC CPU2017, DCPerf, and MLPerf using cross-suite microarchitectural metrics. SPEC CPU2026 remains a general-purpose suite with complementary characteristics: it is less vector-intensive than MLPerf and has lower frontend pressure than DCPerf, yet moves closer to real-world CPU behavior than prior SPEC CPU generations. Finally, we show that SPEC CPU2026 supports practical architectural studies beyond aggregate scores through case studies on page sizes and allocators, prefetching, compiler optimizations, ISA sensitivity, and many-core scaling. The new round-robin stagger mode generates proxy workloads that approximate DCPerf, reducing the IPC gap to 13.7%. Overall, SPEC CPU2026 sets a new foundation for rigorous and cost-effective CPU evaluation.
AIDec 24, 2023
LARP: Language-Agent Role Play for Open-World GamesMing Yan, Ruihao Li, Hao Zhang et al.
Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios. The project page is released at https://miao-ai-lab.github.io/LARP/.
79.1HCMar 18
SPRITE: From Static Mockups to Engine-Ready Game UIYunshu Bai, RuiHao Li, Hao Zhang et al.
Game UI implementation requires translating stylized mockups into interactive engine entities. However, current "Screenshot-to-Code" tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision-Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container relationships and non-rectangular layouts. We evaluated SPRITE against a curated Game UI benchmark and conducted expert reviews with professional developers to assess reconstruction fidelity and prototyping efficiency. Our findings demonstrate that SPRITE streamlines development by automating tedious coding and resolving complex nesting. By facilitating rapid in-engine iteration, SPRITE effectively blurs the boundaries between artistic design and technical implementation in game development. Project page: https://baiyunshu.github.io/sprite.github.io/
AIDec 11, 2025
Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content GenerationLim Chien Her, Ming Yan, Yunshu Bai et al.
Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.
CVJun 26, 2019
FA-Harris: A Fast and Asynchronous Corner Detector for Event CamerasRuoxiang Li, Dianxi Shi, Yongjun Zhang et al.
Recently, the emerging bio-inspired event cameras have demonstrated potentials for a wide range of robotic applications in dynamic environments. In this paper, we propose a novel fast and asynchronous event-based corner detection method which is called FA-Harris. FA-Harris consists of several components, including an event filter, a Global Surface of Active Events (G-SAE) maintaining unit, a corner candidate selecting unit, and a corner candidate refining unit. The proposed G-SAE maintenance algorithm and corner candidate selection algorithm greatly enhance the real-time performance for corner detection, while the corner candidate refinement algorithm maintains the accuracy of performance by using an improved event-based Harris detector. Additionally, FA-Harris does not require artificially synthesized event-frames and can operate on asynchronous events directly. We implement the proposed method in C++ and evaluate it on public Event Camera Datasets. The results show that our method achieves approximately 8x speed-up when compared with previously reported event-based Harris detector, and with no compromise on the accuracy of performance.
CVSep 20, 2017
UnDeepVO: Monocular Visual Odometry through Unsupervised Deep LearningRuihao Li, Sen Wang, Zhiqiang Long et al.
We propose a novel monocular visual odometry (VO) system called UnDeepVO in this paper. UnDeepVO is able to estimate the 6-DoF pose of a monocular camera and the depth of its view by using deep neural networks. There are two salient features of the proposed UnDeepVO: one is the unsupervised deep learning scheme, and the other is the absolute scale recovery. Specifically, we train UnDeepVO by using stereo image pairs to recover the scale but test it by using consecutive monocular images. Thus, UnDeepVO is a monocular system. The loss function defined for training the networks is based on spatial and temporal dense information. A system overview is shown in Fig. 1. The experiments on KITTI dataset show our UnDeepVO achieves good performance in terms of pose accuracy.