MLPs at the EOC: Concentration of the NTK
This provides theoretical guarantees for NTK concentration in practical finite-width neural networks, which is incremental but important for understanding optimization and generalization in deep learning.
The paper tackles the problem of proving concentration of the Neural Tangent Kernel (NTK) for finite-width multilayer perceptrons (MLPs) at the Edge of Chaos (EOC) initialization, without relying on asymptotic assumptions, and shows that hidden layer widths must grow quadratically for accurate approximation, with the absolute value activation outperforming ReLU in concentration bounds.
We study the concentration of the Neural Tangent Kernel (NTK) $K_θ: \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ of $l$-layer Multilayer Perceptrons (MLPs) $N : \mathbb{R}^{m_0} \times Θ\to \mathbb{R}^{m_l}$ equipped with activation functions $φ(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ with the parameter $θ\in Θ$ being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries $K_θ(x_{i_1},x_{i_2})$ for $i_1,i_2 \in [1:n]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$ concentrate simultaneously via maximal inequalities, we prove that the NTK matrix $K(θ) = [\frac{1}{n} K_θ(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$ concentrates around its infinitely wide limit $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$ without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k^2 m$ for some $m \in \mathbb{N}+1$ for sufficient concentration. For such MLPs, we obtain the concentration bound $\mathbb{P}( \Vert K(θ) - \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((Δ_φ^{-2} + m_l^{\frac{1}{2}} l) κ_φ^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$ modulo logarithmic terms, where we denoted $Δ_φ= \frac{b^2}{a^2+b^2}$ and $κ_φ= \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$. This reveals in particular that the absolute value ($Δ_φ=1$, $κ_φ=1$) beats the ReLU ($Δ_φ=\frac{1}{2}$, $κ_φ=\sqrt{2}$) in terms of the concentration of the NTK.