LGJan 22, 2025

MLPs at the EOC: Spectrum of the NTK

arXiv:2501.13225v11 citationsh-index: 3

Originality Incremental advance

AI Analysis

This provides theoretical insights into the training dynamics of deep neural networks for researchers in machine learning theory, but it is incremental as it builds on existing NTK and Edge of Chaos frameworks.

The paper tackles the problem of analyzing the Neural Tangent Kernel (NTK) for infinitely wide multilayer perceptrons with specific activation functions at the Edge of Chaos, finding that the NTK entries can be approximated by inverse cosine distances of activations, with tight spectral bounds showing that the condition number convergence rate depends on a parameter Δφ, where absolute value activation outperforms ReLU.

We study the properties of the Neural Tangent Kernel (NTK) $\overset{\scriptscriptstyle\infty}{K} : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ corresponding to infinitely wide $l$-layer Multilayer Perceptrons (MLPs) taking inputs from $\mathbb{R}^{m_0}$ to outputs in $\mathbb{R}^{m_l}$ equipped with activation functions $φ(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ and initialized at the Edge Of Chaos (EOC). We find that the entries $\overset{\scriptscriptstyle\infty}{K}(x_1,x_2)$ can be approximated by the inverses of the cosine distances of the activations corresponding to $x_1$ and $x_2$ increasingly better as the depth $l$ increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix $\overset{\scriptscriptstyle\infty}{K} = [\frac{1}{n} \overset{\scriptscriptstyle\infty}{K}(x_{i_1},x_{i_2}) : i_1, i_2 \in [1:n]]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$, transferred from the inverse cosine distance matrix via our approximation result. Our results show that $Δ_φ= \frac{b^2}{a^2+b^2}$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ($Δ_φ=1$) is better than the ReLU ($Δ_φ=\frac{1}{2}$) in this regard.

View on arXiv PDF

Similar