SE AIMay 21

A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Joshua Odmark, Gideon Rubin, Deon van der Vyver

arXiv:2605.2305814.1Has Code

Predicted impact top 86% in SE · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners working on autonomous Kubernetes operations, this work provides a methodology to make empirical claims falsifiable, but the contribution is incremental as it adapts existing verification concepts to a new domain.

The paper addresses the lack of falsifiability in empirical claims about autonomous Kubernetes operations agents by introducing a closed-loop measurement framework called agent-breakage. The framework caught three confounds that would have produced wrong claims and found that retrieval over past postmortems provided a partial falsification with a pooled effect of +3.9 percentage points (not significant at n=60).

Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

View on arXiv PDF

Similar