The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible

Lauri Lovén, Nam Do, Hassan Mehmood, Dinesh Kumar Sah, Sasu Tarkoma

arXiv:2605.2573913.3

Predicted impact top 50% in LG · last 90 daysOriginality Highly original

AI Analysis

This work identifies a fundamental impossibility for AI systems that must balance helpfulness, calibration, and autonomy, which is critical for developers of trustworthy autonomous agents.

The paper proves the Behavioral Credibility Trilemma: no reinforcement learning policy with confidence-gated autonomy can simultaneously maximize helpfulness, optimal calibration, and full autonomy when tasks exceed the agent's competence, due to geometric impossibility in scoring rules. Experiments confirm predicted confidence inflation (effect sizes d=1.10 to 5.32) and show a plateau-truncated frontier.

We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.

View on arXiv PDF

Similar