Utilizing Network Properties to Detect Erroneous Inputs
This addresses a critical issue for AI safety and reliability by providing a lightweight detection method, though it is incremental as it builds on existing feature analysis techniques.
The paper tackled the problem of detecting various erroneous inputs like adversarial and out-of-distribution data in neural networks by using a linear SVM classifier on activation features, achieving the ability to reject bad inputs without extra training or overhead.
Neural networks are vulnerable to a wide range of erroneous inputs such as adversarial, corrupted, out-of-distribution, and misclassified examples. In this work, we train a linear SVM classifier to detect these four types of erroneous data using hidden and softmax feature vectors of pre-trained neural networks. Our results indicate that these faulty data types generally exhibit linearly separable activation properties from correct examples, giving us the ability to reject bad inputs with no extra training or overhead. We experimentally validate our findings across a diverse range of datasets, domains, pre-trained models, and adversarial attacks.