AICLMAMay 19, 2025

Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems

arXiv:2505.13546v19 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the need for more robust and trustworthy prompt generation systems in AI, though it is incremental by building on existing methods with a focus on stability.

The paper tackles the problem of unreliable auto-generated prompts in general-purpose multi-agent systems by introducing prompt stability as a key factor, and it shows that their stability-aware framework improves both accuracy and output consistency across tasks.

Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes