LG STAT-MECHMar 17, 2025

High-entropy Advantage in Neural Networks' Generalizability

Entao Yang, Xiaotian Zhang, Yue Shang, Ge Zhang

arXiv:2503.13145v22 citationsh-index: 1

Originality Incremental advance

AI Analysis

This provides a thermodynamic explanation for generalization, potentially impacting training optimizers for different network sizes, though it is incremental in applying existing physics concepts to ML.

The authors tackled the problem of understanding neural network generalization by introducing Boltzmann entropy to model networks as molecular systems, revealing that high-entropy states outperform conventional training methods across tasks like image recognition and language modeling, with up to 1 million parameters.

One of the central challenges in modern machine learning is understanding how neural networks generalize knowledge learned from training data to unseen test data. While numerous empirical techniques have been proposed to improve generalization, a theoretical understanding of the mechanism of generalization remains elusive. Here we introduce the concept of Boltzmann entropy into neural networks by re-conceptualizing such networks as hypothetical molecular systems where weights and biases are atomic coordinates, and the loss function is the potential energy. By employing molecular simulation algorithms, we compute entropy landscapes as functions of both training loss and test accuracy (or test loss), on networks with up to 1 million parameters, across four distinct machine learning tasks: arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of high-entropy advantage, wherein high-entropy network states generally outperform those reached via conventional training techniques like stochastic gradient descent. This entropy advantage provides a thermodynamic explanation for neural network generalizability: the generalizable states occupy a larger part of the parameter space than its non-generalizable analog at low train loss. Furthermore, we find this advantage more pronounced in narrower neural networks, indicating a need for different training optimizers tailored to different sizes of networks.

View on arXiv PDF

Similar