LG MLJun 1, 2021

Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics

arXiv:2106.00734v215.523 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of interpreting generalization metrics for researchers in deep learning, but it is incremental as it builds on existing HT-SR theory.

The paper analyzed a contest dataset to understand generalization in neural networks, revealing a Simpson's paradox where scale metrics perform well overall but poorly on subpartitions, while shape metrics show the opposite pattern, clarifying why the ALPHAHAT metric works well across models.

To understand better good generalization performance in state-of-the-art neural network (NN) models, and in particular the success of the ALPHAHAT metric based on Heavy-Tailed Self-Regularization (HT-SR) theory, we analyze of a corpus of models that was made publicly-available for a contest to predict the generalization accuracy of NNs. These models include a wide range of qualities and were trained with a range of architectures and regularization hyperparameters. We break ALPHAHAT into its two subcomponent metrics: a scale-based metric; and a shape-based metric. We identify what amounts to a Simpson's paradox: where "scale" metrics (from traditional statistical learning theory) perform well in aggregate, but can perform poorly on subpartitions of the data of a given depth, when regularization hyperparameters are varied; and where "shape" metrics (from HT-SR theory) perform well on each subpartition of the data, when hyperparameters are varied for models of a given depth, but can perform poorly overall when models with varying depths are aggregated. Our results highlight the subtlety of comparing models when both architectures and hyperparameters are varied; the complementary role of implicit scale versus implicit shape parameters in understanding NN model quality; and the need to go beyond one-size-fits-all metrics based on upper bounds from generalization theory to describe the performance of NN models. Our results also clarify further why the ALPHAHAT metric from HT-SR theory works so well at predicting generalization across a broad range of CV and NLP models.

View on arXiv PDF

Similar