CVNov 8, 2016

The Loss Surface of Residual Networks: Ensembles and the Role of Batch Normalization

arXiv:1611.02525v115 citations
Originality Incremental advance
AI Analysis

This work provides insights into the training dynamics of residual networks, which are widely used in deep learning for improved performance and trainability at extreme depths.

The paper investigates the dynamic ensemble behavior of deep residual networks, showing that the virtual ensemble evolves from shallower to deeper depths during training, primarily driven by scaling mechanisms like batch normalization, and uses spin glass models to analyze critical points in the optimization landscape.

Deep Residual Networks present a premium in performance in comparison to conventional networks of the same depth and are trainable at extreme depths. It has recently been shown that Residual Networks behave like ensembles of relatively shallow networks. We show that these ensembles are dynamic: while initially the virtual ensemble is mostly at depths lower than half the network's depth, as training progresses, it becomes deeper and deeper. The main mechanism that controls the dynamic ensemble behavior is the scaling introduced, e.g., by the Batch Normalization technique. We explain this behavior and demonstrate the driving force behind it. As a main tool in our analysis, we employ generalized spin glass models, which we also use in order to study the number of critical points in the optimization of Residual Networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes