CRAICYLGMLMay 29, 2025

Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

arXiv:2505.23968v12 citationsh-index: 31ICML
Originality Highly original
AI Analysis

This addresses a novel security problem in safety-critical ML applications where abstention can be abused for discrimination, offering a solution to ensure fairness and trust.

The paper tackles the threat of dishonest institutions exploiting model abstention mechanisms to discriminate by introducing Mirage, an attack that reduces confidence in targeted inputs while maintaining overall performance, and proposes Confidential Guardian, a framework that detects artificially suppressed confidence and uses zero-knowledge proofs to verify inference, effectively preventing misuse with verifiable assurances.

Cautious predictions -- where a machine learning model abstains when uncertain -- are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model's proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes