LGFeb 18, 2022

On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

arXiv:2202.09028v916 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding optimization biases in deep learning for researchers, providing insights into neural collapse and generalization, though it is incremental in building on existing neural collapse literature.

The study investigates the implicit bias of stochastic gradient descent (SGD) towards selecting neural networks with minimal effective depth, defined as the first layer where embeddings become separable, and shows empirically that SGD favors small effective depths. It also links the degree of separability in intermediate layers to generalization, deriving a bound that provides non-trivial test performance estimates and demonstrating that effective depth increases with more random labels in data.

Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{degree of separability} in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance. Finally, we empirically show that the effective depth of a trained neural network monotonically increases when increasing the number of random labels in data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes