SE AIMay 7

A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

arXiv:2605.0673769.0

Predicted impact top 34% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For developers deploying LLM-based agents in production, this work addresses the critical reliability challenge by providing an integrated monitoring and recovery system.

This paper proposes a self-healing framework for LLM-based autonomous agents that integrates failure detection, reliability assessment, and automated recovery. Experiments show significant improvements in task success rates and system robustness compared to existing methods.

Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability-aware self-healing framework for LLM-based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of failure types and introduce a quantitative reliability assessment model. Next, we propose a failure detection method that identifies abnormal agent behavior based on execution patterns and output consistency. Finally, we design a self-healing mechanism that dynamically recovers from failures through adaptive replanning and corrective prompting strategies. The proposed framework was implemented in a multi-agent workflow environment and evaluated using real-world task scenarios. Experimental results demonstrate that our approach significantly increases task success rates, reduces failure propagation, and enhances overall system robustness compared to existing methods. In particular, this study distinguishes itself by establishing an integrated monitoring system that combines the agent's internal reasoning process with external execution results. These findings are expected to contribute to securing the stability of advanced autonomous systems and lowering the barriers to LLM adoption in production environments.

View on arXiv PDF

Similar