Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures
This reveals a critical security flaw in deploying autonomous AI agents, as undetectable errors undermine runtime safety assumptions, with implications for model governance and reliability.
The study found that instruction-tuned language models exhibit silent commitment failures, where errors are undetectable before output, with two of three models showing zero warning signals, while one model provided a detectable conflict 57 tokens early, and governability varied significantly across architectures.
As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.