LGMLMay 31, 2019

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

arXiv:1905.13405v433 citations
Originality Incremental advance
AI Analysis

This provides insights into deep learning phenomena like over-parameterization and implicit regularization, but it is incremental as it builds on existing teacher-student frameworks.

The paper tackles the problem of understanding training dynamics in deep ReLU networks and their impact on generalization, revealing that nodes initialized close to teacher nodes converge faster and that in over-parameterized regimes, only a small set of 'lucky' nodes converge while others' weights vanish.

We analyze the dynamics of training deep ReLU networks and their implications on generalization capability. Using a teacher-student setting, we discovered a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks. With this relationship and the assumption of small overlapping teacher node activations, we prove that (1) student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate, and (2) in over-parameterized regimes and 2-layer case, while a small set of lucky nodes do converge to the teacher nodes, the fan-out weights of other nodes converge to zero. This framework provides insight into multiple puzzling phenomena in deep learning like over-parameterization, implicit regularization, lottery tickets, etc. We verify our assumption by showing that the majority of BatchNorm biases of pre-trained VGG11/16 models are negative. Experiments on (1) random deep teacher networks with Gaussian inputs, (2) teacher network pre-trained on CIFAR-10 and (3) extensive ablation studies validate our multiple theoretical predictions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes