Separation of Scales and a Thermodynamic Description of Feature Learning in Some CNNs
This provides a foundational thermodynamic description for understanding feature learning in DNNs, which is incremental as it builds on existing infinite-width theories.
The paper tackles the challenge of analyzing deep neural networks (DNNs) by identifying a separation of scales in trained convolutional and fully connected networks, showing that layers couple through second moments of activations, leading to a thermodynamic theory that yields accurate predictions in various settings.
Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent parameters, render direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behavior of the fast microscopic variables. Here, we identify a similar separation of scales occurring in fully trained finitely over-parameterized deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only through the second moment (kernels) of their activations and pre-activations. Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width DNNs, these kernels are inert, while for finite ones they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analyzing and understanding DNNs in general.