MLLGJan 17, 2025

Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

arXiv:2501.10202v12 citationsh-index: 1ICLR
Originality Highly original
AI Analysis

It addresses the critical issue of model robustness for AI safety by providing a provable defense mechanism, though it builds incrementally on extreme value theory for detection.

The paper tackles the problem of protecting classifiers from out-of-distribution and adversarial samples by introducing SPADE, a method that transforms classifiers into abstaining ones with provable safeguards, achieving efficient and stable performance across multiple neural architectures and datasets like CIFAR-10, CIFAR-100, and ImageNet.

This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a Generalized Extreme Value (GEV) model of the training distribution in the classifier's latent space, enabling the formal characterization of OOD samples. Interestingly, under mild assumptions, the GEV model also allows for formally characterizing adversarial samples. The abstaining classifier, which rejects samples based on their assessment by the GEV model, provably avoids OOD and adversarial samples. The empirical validation of the approach, conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet), demonstrates its frugality, stability, and efficiency compared to the state of the art.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes