LGFeb 22, 2025

Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric

arXiv:2502.16329v2
Originality Incremental advance
AI Analysis

This addresses the reliability of AI systems by providing a standardized approach for data comparison, though it is incremental as it builds on existing out-of-distribution detection methods.

The paper tackles the problem of machine learning models failing to generalize to new data by proposing SAGE, a model-agnostic method for assessing data similarity, which improves out-of-the-box model performance on datasets like MNIST, CIFAR-10, and UCI Abalone after filtering.

Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems, partly due to the lack of simple and robust methods for comparing new data to the original training dataset. We propose a standardized approach for assessing data similarity in a model-agnostic manner by constructing a supervised autoencoder for generalizability estimation (SAGE). We compare points in a low-dimensional embedded latent space, defining empirical probability measures for k-Nearest Neighbors (kNN) distance, reconstruction of inputs and task-based performance. As proof of concept for classification tasks, we use MNIST and CIFAR-10 to demonstrate how an ensemble output probability score can separate deformed images from a mixture of typical test examples, and how this SAGE score is robust to transformations of increasing severity. As further proof of concept, we extend this approach to a regression task using non-imaging data (UCI Abalone). In all cases, we show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets. Our out-of-distribution scoring method can be introduced during several steps of model construction and assessment, leading to future improvements in responsible deep learning implementation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes