LGMLDec 31, 2020

Explicit regularization and implicit bias in deep network classifiers trained with the square loss

arXiv:2101.00072v145 citations
Originality Incremental advance
AI Analysis

This work provides a theoretical understanding of the implicit bias and regularization in deep learning, which is important for researchers and practitioners trying to understand and improve deep network classifiers.

This paper theoretically justifies the good performance of deep ReLU networks trained with the square loss in classification tasks. It shows that using normalization techniques like Batch Normalization (BN) or Weight Normalization (WN) with Weight Decay (WD) leads to convergence to a minimum norm solution, which is associated with better margin and classification error bounds.

Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We provide here a theoretical justification based on analysis of the associated gradient flow. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are used together with Weight Decay (WD). The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. With BN but in the absence of WD, the dynamical system is singular. Implicit dynamical regularization -- that is zero-initial conditions biasing the dynamics towards high margin solutions -- is also possible in the no-BN and no-WD case. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donoho's Neural Collapse and the constraints induced by BN on the network weights.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes