AIMay 18

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

arXiv:2605.1914070.4

Predicted impact top 49% in AI · last 90 daysOriginality Highly original

AI Analysis

This work provides the first finite-sample guarantee for neural Q-learning under decentralized partial observability, addressing a key bottleneck in multi-agent systems spanning organizational boundaries.

The paper formalizes multi-agent LLM pipelines with interface constraints as an IC-SMDP and proposes IC-Q, an asynchronous decentralized Q-learning algorithm with a finite-sample bound. Experiments show IC-Q matches a centralized oracle across four tasks without any agent observing joint trajectories.

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

View on arXiv PDF

Similar