AIOct 11, 2025

Failure-Driven Workflow Refinement

Jusheng Zhang, Kaitong Cai, Qinglin Zeng, Ningyuan Liu, Stephen Fan, Ziliang Chen, Keze Wang

arXiv:2510.10035v113 citationsh-index: 12

Originality Highly original

AI Analysis

This work addresses the reliability of LLM-based workflows for users who depend on robust AI systems, representing a new paradigm rather than an incremental improvement.

The paper tackles the problem of optimizing LLM-based workflows by addressing information collapse in existing methods, proposing a new paradigm that minimizes Expected Failure Mass in a Failure Signature Space, resulting in higher robustness at significantly lower cost on math, code, and QA benchmarks.

Optimizing LLM-based workflows is typically formulated as a global search, where candidate workflows are evaluated based on a scalar metric. This paradigm, however, suffers from a critical flaw: information collapse. By reducing rich, multi-step execution traces to simple success/failure signals, existing methods are rendered blind to the underlying structure of failures, fundamentally preventing them from modeling the workflow's failure distribution. We reconceptualize this challenge as a distributional problem. We propose a new paradigm where the optimization goal is not to maximize a scalar score, but to directly minimize a workflow's Expected Failure Mass, i.e., the integral of its failure probability density function defined over a high-dimensional Failure Signature Space (FSS). This distributional lens allows us to move from inefficient, zero-order optimization to a principled, gradient-like descent on the failure landscape itself. We introduce CE-Graph, a framework that operationalizes this paradigm through a novel, failure-driven refinement process. CE-Graph approximates the failure distribution from a pool of counterexamples, identifies its densest regions as recurring failure modes, and applies targeted, operator-constrained graph edits via a Propose-and-Verify mechanism to greedily reduce the failure mass. On math, code, and QA benchmarks, our CE-Graph achieves higher robustness at a significantly lower cost than strong baselines. This suggests that a system's reliability emerges not from avoiding failures, but from systematically learning and reshaping the geometric structure of its failure distributions.

View on arXiv PDF

Similar