ML LG NE OCAug 5, 2019

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes

arXiv:1908.02419v317.965 citations

Originality Highly original

AI Analysis

This addresses the theoretical understanding of trainability and generalization for deep learning practitioners, providing a more practical over-parameterization requirement compared to prior work.

The paper theoretically proves that gradient descent can find global minima for non-convex optimization in deep neural networks of practical sizes, requiring only a linear increase in parameters with training samples, which is shown to be optimal. It also demonstrates that such networks generalize well to natural datasets but not random ones.

In this paper, we theoretically prove that gradient descent can find a global minimum of non-convex optimization of all layers for nonlinear deep neural networks of sizes commonly encountered in practice. The theory developed in this paper only requires the practical degrees of over-parameterization unlike previous theories. Our theory only requires the number of trainable parameters to increase linearly as the number of training samples increases. This allows the size of the deep neural networks to be consistent with practice and to be several orders of magnitude smaller than that required by the previous theories. Moreover, we prove that the linear increase of the size of the network is the optimal rate and that it cannot be improved, except by a logarithmic factor. Furthermore, deep neural networks with the trainability guarantee are shown to generalize well to unseen test samples with a natural dataset but not a random dataset.

View on arXiv PDF

Similar