Reject Illegal Inputs with Generative Classifier Derived from Any Discriminative Classifier
This addresses security and reliability issues in machine learning systems by enhancing detection of malicious or anomalous inputs, though it is incremental as it builds on existing SDIM framework.
The paper tackles the problem of detecting illegal inputs like adversarial examples and out-of-distribution samples by proposing SDIM-logit, a method that derives a generative classifier from any discriminative classifier's logits, resulting in significant performance improvements on remaining test sets when rejecting a portion of samples.
Generative classifiers have been shown promising to detect illegal inputs including adversarial examples and out-of-distribution samples. Supervised Deep Infomax~(SDIM) is a scalable end-to-end framework to learn generative classifiers. In this paper, we propose a modification of SDIM termed SDIM-\emph{logit}. Instead of training generative classifier from scratch, SDIM-\emph{logit} first takes as input the logits produced any given discriminative classifier, and generate logit representations; then a generative classifier is derived by imposing statistical constraints on logit representations. SDIM-\emph{logit} could inherit the performance of the discriminative classifier without loss. SDIM-\emph{logit} incurs a negligible number of additional parameters, and can be efficiently trained with base classifiers fixed. We perform \emph{classification with rejection}, where test samples whose class conditionals are smaller than pre-chosen thresholds will be rejected without predictions. Experiments on illegal inputs, including adversarial examples, samples with common corruptions, and out-of-distribution~(OOD) samples show that allowed to reject a portion of test samples, SDIM-\emph{logit} significantly improves the performance on the left test sets.