SGD Learns the Conjugate Kernel Class of the Network
This addresses the theoretical understanding of SGD's efficiency in training deep neural networks, providing polynomial-time guarantees for learning in rich architectures, which is incremental but important for bridging theory and practice in machine learning.
The paper proves that stochastic gradient descent (SGD) can learn a function competitive with the best in the conjugate kernel space of neural networks in polynomial time for log-depth architectures, marking the first such guarantee for networks deeper than two layers. As corollaries, it shows SGD learns constant-degree polynomials with bounded coefficients in polynomial time and any continuous function given sufficient network size, though not necessarily in polynomial time.
We show that the standard stochastic gradient decent (SGD) algorithm is guaranteed to learn, in polynomial time, a function that is competitive with the best function in the conjugate kernel space of the network, as defined in Daniely, Frostig and Singer. The result holds for log-depth networks from a rich family of architectures. To the best of our knowledge, it is the first polynomial-time guarantee for the standard neural network learning algorithm for networks of depth more that two. As corollaries, it follows that for neural networks of any depth between $2$ and $\log(n)$, SGD is guaranteed to learn, in polynomial time, constant degree polynomials with polynomially bounded coefficients. Likewise, it follows that SGD on large enough networks can learn any continuous function (not in polynomial time), complementing classical expressivity results.