34.9LGMay 21
MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model CoordinationJoss Armstrong
Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically mis-calibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi Agent Runtime Grading via Incremental Normalization), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 19 foundation models, 8 benchmarks, and over 50,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence produces pairwise resolution worse than random (45-56%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and surpassing the always-best-model oracle on three of four benchmarks. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.
13.0AIMay 19
BOHM: Zero-Cost Hierarchical Attribution for Compound AI SystemsJoss Armstrong
Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), which decompose a coalition value function into per-component marginal contributions and require evaluation of the system on arbitrary component subsets. That requirement fails for third-party APIs, opaque endpoints, and agentic orchestrators that concentrate routing on a few tools, leaving most coalitions un-evaluable from the deployed orchestrator. We introduce BOHM, which extracts a hierarchical attribution tree directly from the routing weights such systems already maintain: leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes. The method has zero marginal cost, requires no access to component internals, and provides multi-resolution attribution at every level simultaneously, which flat methods cannot offer at any evaluation budget. BOHM and SHAP answer different questions and converge when the deployed router routes near-optimally. On 18 LLMs in a 3-level hierarchy over 880 LiveCodeBench problems, BOHM yields Kendall tau=0.928; SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed. On a 5-driver, 7-benchmark agentic study (35 cells, complete coverage), drivers concentrate routing on a single tool (top-share median 0.65), and cell-level tau(BOHM,SHAP) is predicted by whether the driver's top pick is the empirically best tool (mean +0.22 vs ~+0.01). On a US Census hierarchy (475 leaves, 4 levels), BOHM recovers ground-truth rankings at every level (tau up to 0.722). BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley's additivity. It is best understood as a complementary primitive: a multi-resolution decomposition computable wherever routing state exists, whose disagreement with Shapley is itself diagnostic.
10.5GTApr 30
Implicit Evaluation Under Minimal Information: Price Formation in Hierarchical Component SelectionJoss Armstrong
We study hierarchical component selection under severe information constraints. Component quality is not directly observable, each selector observes only the outcome of the chosen pathway, and no explicit evaluation channel crosses module boundaries. We analyse a proportional-redistribution mechanism in which each selector maintains a weight vector over its children and updates that vector from observed outcomes. The sign of a parent's weight change can be read locally as an implicit binary evaluation signal by the selected child, yielding a decentralised evaluation mechanism with no explicit reporting channel. We give a full formal treatment. Proportional redistribution preserves market integrity algebraically. The sign of the weight change propagates without loss through the active path. The single-selector dynamics admit a unique interior equilibrium; for $N{=}2$ the equilibrium is exact and closed-form, while for general $N$ an equi-ratio condition yields an explicit affine equilibrium. Hierarchical composition is informationally clean, with each node's active-round dynamics identical to a standalone instance observed on a thinned clock. All structural results, the equilibrium formula, and the composition theorem are fully proved. Illustrative cases on synthetic hierarchies with up to 32,768 leaves and on three natural-hierarchy datasets confirm the mechanism's operation under constructed and applied conditions.
3.1GTApr 29
MISES: Minimal Information Sufficiency for Effective ServiceJoss Armstrong
Category-based coordination mechanisms allocate resources by mapping a declared service category to a fixed resource profile, without observing individual demand types. We establish three results for this class of mechanisms. First, the relative welfare gap Delta satisfies a tight two-sided bound in terms of the aggregate within-category allocation variance epsilon: (alpha/2W*)epsilon <= Delta <= (beta/2W*)epsilon. Second, the expected misreporting gain is bounded by the same epsilon without assumptions on agent strategy; demand-derived categories minimise both welfare loss and misreporting incentive simultaneously. Third, aggregate outcome metrics strictly dominate per-agent metrics for service-level detection under a homogeneity condition, for all parameter values, with a finite-sample power gap of O(1/m). At any fixed K, the demand-derived category label is the sufficient statistic for coordination: collecting per-agent data beyond the category label adds noise to the detection problem without reducing the welfare gap. However, welfare and detection impose structurally opposed demands on K: welfare improves with finer categories, detection worsens. The designer faces a feasibility band [Kmin, Kmax] and must choose K within it as a value judgement. We claim that any protocol achieving welfare gap Delta <= epsilon* and missed-detection rate <= beta* requires at least Hlb(epsilon*, beta*) bits of category entropy. We illustrate the mechanism on a synthetic population of 50,000 demand vectors and five weeks of production performance-management data from four anonymised operator networks (28,249 cells).
7.5ITApr 29
A Sufficient-Statistic Reduction of the Information Bottleneck to a Low-Dimensional ProblemJoss Armstrong
We show that if the conditional distribution p(C | T) factors through a sufficient statistic ϕ(T), then the Information Bottleneck (IB) problem for (T, C) is exactly equivalent to the IB problem for (ϕ(T), C). The reduction is loss-free: it preserves the full IB curve, the Lagrangian optimum at every trade-off parameter \b{eta}, and the optimal representations up to pullback through ϕ. As a result, the computational complexity of solving the IB problem is governed by the dimension of the sufficient statistic rather than the ambient dimension of the source. This identifies an exact structural condition under which the generic IB problem becomes tractable, and gives a formal bridge between the discrete and linear-Gaussian regimes. We then show that the classical Gaussian IB solution of Chechik, Globerson, Tishby and Weiss is an immediate corollary of this reduction, and we state a nonlinear-Gaussian generalisation. A small numerical example illustrates the practical consequence: when a low-dimensional sufficient statistic is available, the exact IB curve can be computed on the reduced problem at a cost determined by the statistic rather than by the ambient source dimension.