LG DIS-NN NE PRAug 3, 2023

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Microsoft

arXiv:2308.01814v222.332 citationsh-index: 20

Originality Incremental advance

AI Analysis

This work provides foundational theoretical insights into adaptive optimization in machine learning, addressing a key problem for researchers in neural network theory, though it is incremental as it builds on and generalizes prior results in the series.

The paper tackles the behavior of wide neural networks trained with adaptive optimizers like Adam, showing that the dichotomy between feature learning and kernel behaviors persists, with a nonlinear kernel concept, and derives neural tangent and maximal update limits for any architecture. It introduces NEXORT, a new Tensor Program language for expressing adaptive optimizer updates, and bra-ket notation to simplify calculations, generalizing previous results in the Tensor Programs series.

Going beyond stochastic gradient descent (SGD), what new phenomena emerge in wide neural networks trained by adaptive optimizers like Adam? Here we show: The same dichotomy between feature learning and kernel behaviors (as in SGD) holds for general optimizers as well, including Adam -- albeit with a nonlinear notion of "kernel." We derive the corresponding "neural tangent" and "maximal update" limits for any architecture. Two foundational advances underlie the above results: 1) A new Tensor Program language, NEXORT, that can express how adaptive optimizers process gradients into updates. 2) The introduction of bra-ket notation to drastically simplify expressions and calculations in Tensor Programs. This work summarizes and generalizes all previous results in the Tensor Programs series of papers.

View on arXiv PDF

Similar