LGMLApr 15, 2019

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

arXiv:1904.06963v5122 citations
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient training in deep learning for researchers and practitioners, offering insights into architectural choices, though it is incremental in building on existing optimization theory.

The paper tackles the problem of how neural network architecture affects training speed by introducing gradient confusion, showing that increasing width reduces gradient confusion and speeds up training, while increasing depth has the opposite effect, with alternate initializations or techniques like batch normalization and skip connections mitigating issues in deep networks.

This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes