Haiying Sun

AI
h-index29
3papers
2citations
Novelty53%
AI Score41

3 Papers

94.8CVJun 2
VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

Hang He, Chuhuai Yue, Chengqi Dong et al.

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

AIDec 8, 2025
LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Hang He, Chuhuai Yue, Chengqi Dong et al.

Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high-quality entries from various cities and business types. We construct 300 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at localsearchbench.github.io.

SEApr 1, 2025
Automated detection of atomicity violations in large-scale systems

Hang He, Yixing Luo, Chengcheng Wan et al.

Atomicity violations in interrupt-driven programs pose a significant threat to software reliability in safety-critical systems. These violations occur when the execution sequence of operations on shared resources is disrupted by asynchronous interrupts. Detecting atomicity violations is challenging due to the vast program state space, application-level code dependencies, and complex domain-specific knowledge. In this paper, we propose CLOVER, a multi-agent framework for detecting atomicity violations in real-world interrupt-driven programs. Its plan agent orchestrates four static analysis tools to extract key information and generate code summaries. CLOVER then initializes several Expert-Judge agent pairs to detect and validate different patterns of atomicity violation, through an iterative manner. Evaluations on RaceBench, SV-COMP, and RWIP demonstrate that CLOVER achieves a precision/recall of 91.0%/96.4%, outperforming existing approaches by 33.0-117.2% on F1-score. Additionally, it identifies 12 atomicity violations in 11 real-world aerospace software projects, one of which is previously unknown.