LGCVNEMLApr 26, 2019

On Exact Computation with an Infinitely Wide Neural Net

arXiv:1904.11955v21041 citations
Originality Incremental advance
AI Analysis

This work provides a significant benchmark for kernel-based methods in deep learning, advancing theoretical understanding of infinite-width neural nets, though it is incremental as it builds on existing NTK concepts.

The paper tackles the problem of theoretically understanding deep learning by extending the Neural Tangent Kernel (NTK) to convolutional neural nets, introducing Convolutional NTK (CNTK) with an efficient exact algorithm and GPU implementation. This results in a kernel-based method achieving a 10% higher performance on CIFAR-10 than prior methods and only 6% lower than finite deep nets.

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its width --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width. The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for the performance of a pure kernel-based method on CIFAR-10, being $10\%$ higher than the methods reported in [Novak et al., 2019], and only $6\%$ lower than the performance of the corresponding finite deep net architecture (once batch normalization, etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes