LG HEP-PH HEP-TH MLOct 11, 2023

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day, Yonatan Kahn, Daniel A. Roberts

arXiv:2310.07765v29.85 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses training and generalization issues in deep networks with depth comparable to width, offering a potential improvement over standard methods, though it appears incremental as it builds on existing initialization techniques.

The paper tackles the problem of signal fluctuations in deep neural networks by using orthogonal weight initializations, showing analytically that preactivation fluctuations become depth-independent and numerically that key training correlators saturate at a depth of ~20, unlike Gaussian initializations. It provides experimental justification with improved performance on MNIST and CIFAR-10 tasks.

Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed in deep networks with depth comparable to width. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.

View on arXiv PDF

Similar