LGAIMLJan 7, 2019

Generalization in Deep Networks: The Role of Distance from Initialization

arXiv:1901.01672v297 citations
AI Analysis

This work addresses a fundamental problem in deep learning theory for researchers and practitioners, offering insights into generalization, but it is incremental as it builds on existing ideas about implicit regularization.

The paper investigates why deep neural networks trained with stochastic gradient descent (SGD) generalize well despite having many parameters, proposing that effective model capacity depends on random initialization and is restricted by implicit regularization of the ℓ₂ distance from initialization. It provides empirical evidence and theoretical arguments to support this notion, leaving open questions about the mechanisms and sufficiency of this regularization.

Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $\ell_2$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes