Huacan Chai

AI
h-index56
9papers
236citations
Novelty44%
AI Score56

9 Papers

CLMay 15Code
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

Huacan Chai, Yukai Wang, Yingxuan Yang et al.

Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.

CLSep 5, 2023
CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Lingyue Fu, Huacan Chai, Shuang Luo et al.

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is crucial as it reflects the multifaceted abilities of LLMs, and it has numerous downstream applications. In this paper, we propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. Programming comprehension task tests LLMs on multiple-choice exam questions covering conceptual understanding, commonsense reasoning, and multi-hop reasoning. The code generation task evaluates LLMs through completing C++ functions based on provided descriptions and prototypes. The code correction task asks LLMs to fix real-world erroneous code segments with different error messages. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively. Compared to human performance, there is still significant room for improvement in LLM programming. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth.

AIJul 28, 2025Code
Agentic Web: Weaving the Next Web with AI Agents

Yingxuan Yang, Mulei Ma, Yuxuan Huang et al.

The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human-driven to machine-to-machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including communication protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: https://github.com/SafeRL-Lab/agentic-web.

SEMay 3, 2024Code
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Kounianhua Du, Jizheng Chen, Renting Rui et al.

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing between natural language and programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework that bridges the gap between NL and PL to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging syntax graph into informative graphical view for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into model parameters via finetuning the models with the soft signals encoded by GNN expert model. Specifically, two constraints are designed to improve the alignment and structure expressiveness, contributing to the informativeness of the single-token-sized external <GraphEmb> for enhanced code generation. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. Implementation is available at https://anonymous.4open.science/r/Code-5970/ .

SEApr 9
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen et al.

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade-off between parametric and externalized capability, identify emerging directions such as self-evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long-term co-evolution of models and external infrastructure. The result is a systems-level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

AIApr 23, 2025
A Survey of AI Agent Protocols

Yingxuan Yang, Huacan Chai, Yuanyi Song et al.

The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two-dimensional classification that differentiates context-oriented versus inter-agent protocols and general-purpose versus domain-specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next-generation protocols. These characteristics include adaptability, privacy preservation, and group-based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.

MAApr 1, 2025
AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Yingxuan Yang, Huacan Chai, Shuai Shao et al.

The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational collaboration, resulting in siloed expertise. We propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to specialize, evolve, and collaborate autonomously in a dynamically structured Directed Acyclic Graph (DAG). Unlike prior approaches with static roles or centralized control, AgentNet allows agents to adjust connectivity and route tasks based on local expertise and context. AgentNet introduces three key innovations: (1) a fully decentralized coordination mechanism that eliminates the need for a central orchestrator, enhancing robustness and emergent intelligence; (2) dynamic agent graph topology that adapts in real time to task demands, ensuring scalability and resilience; and (3) a retrieval-based memory system for agents that supports continual skill refinement and specialization. By minimizing centralized control and data exchange, AgentNet enables fault-tolerant, privacy-preserving collaboration across organizations. Experiments show that AgentNet achieves higher task accuracy than both single-agent and centralized multi-agent baselines.

AIOct 16, 2025
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

Jianghao Lin, Yuanyuan Shi, Xin Peng et al.

Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs: "explore more but retain less" due to the unrecoverability characteristics of structured function calling generation.

CLSep 27, 2025
PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Huacan Chai, Zijie Cao, Maolin Ran et al.

Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.