Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
This provides practitioners with a method to assess trust in AI systems for deployment, though it is incremental as it builds on existing sampling and calibration techniques.
The paper tackles the problem of certifying the reliability of black-box AI systems for specific tasks by introducing a reliability level derived from self-consistency sampling and conformal calibration, with results showing GPT-4.1 achieving 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano scores 89.8% on GSM8K and 66.5% on MMLU.
Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.