Diego González-Sánchez

LG
h-index3
4papers
15citations
Novelty61%
AI Score30

4 Papers

LGMay 26, 2022
A Framework for Overparameterized Learning

Dávid Terjék, Diego González-Sánchez

A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.

LGJan 24, 2025
MLPs at the EOC: Concentration of the NTK

Dávid Terjék, Diego González-Sánchez

We study the concentration of the Neural Tangent Kernel (NTK) $K_θ: \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ of $l$-layer Multilayer Perceptrons (MLPs) $N : \mathbb{R}^{m_0} \times Θ\to \mathbb{R}^{m_l}$ equipped with activation functions $φ(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ with the parameter $θ\in Θ$ being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries $K_θ(x_{i_1},x_{i_2})$ for $i_1,i_2 \in [1:n]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$ concentrate simultaneously via maximal inequalities, we prove that the NTK matrix $K(θ) = [\frac{1}{n} K_θ(x_{i_1},x_{i_2}) : i_1,i_2 \in [1:n]] \in \mathbb{R}^{nm_l \times nm_l}$ concentrates around its infinitely wide limit $\overset{\scriptscriptstyle\infty}{K} \in \mathbb{R}^{nm_l \times nm_l}$ without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as $m_k = k^2 m$ for some $m \in \mathbb{N}+1$ for sufficient concentration. For such MLPs, we obtain the concentration bound $\mathbb{P}( \Vert K(θ) - \overset{\scriptscriptstyle\infty}{K} \Vert \leq O((Δ_φ^{-2} + m_l^{\frac{1}{2}} l) κ_φ^2 m^{-\frac{1}{2}})) \geq 1-O(m^{-1})$ modulo logarithmic terms, where we denoted $Δ_φ= \frac{b^2}{a^2+b^2}$ and $κ_φ= \frac{\vert a \vert + \vert b \vert}{\sqrt{a^2 + b^2}}$. This reveals in particular that the absolute value ($Δ_φ=1$, $κ_φ=1$) beats the ReLU ($Δ_φ=\frac{1}{2}$, $κ_φ=\sqrt{2}$) in terms of the concentration of the NTK.

LGJan 22, 2025
MLPs at the EOC: Spectrum of the NTK

Dávid Terjék, Diego González-Sánchez

We study the properties of the Neural Tangent Kernel (NTK) $\overset{\scriptscriptstyle\infty}{K} : \mathbb{R}^{m_0} \times \mathbb{R}^{m_0} \to \mathbb{R}^{m_l \times m_l}$ corresponding to infinitely wide $l$-layer Multilayer Perceptrons (MLPs) taking inputs from $\mathbb{R}^{m_0}$ to outputs in $\mathbb{R}^{m_l}$ equipped with activation functions $φ(s) = a s + b \vert s \vert$ for some $a,b \in \mathbb{R}$ and initialized at the Edge Of Chaos (EOC). We find that the entries $\overset{\scriptscriptstyle\infty}{K}(x_1,x_2)$ can be approximated by the inverses of the cosine distances of the activations corresponding to $x_1$ and $x_2$ increasingly better as the depth $l$ increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix $\overset{\scriptscriptstyle\infty}{K} = [\frac{1}{n} \overset{\scriptscriptstyle\infty}{K}(x_{i_1},x_{i_2}) : i_1, i_2 \in [1:n]]$ over a dataset $\{x_1,\cdots,x_n\} \subset \mathbb{R}^{m_0}$, transferred from the inverse cosine distance matrix via our approximation result. Our results show that $Δ_φ= \frac{b^2}{a^2+b^2}$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ($Δ_φ=1$) is better than the ReLU ($Δ_φ=\frac{1}{2}$) in this regard.

OCMay 29, 2021
Optimal transport with $f$-divergence regularization and generalized Sinkhorn algorithm

Dávid Terjék, Diego González-Sánchez

Entropic regularization provides a generalization of the original optimal transport problem. It introduces a penalty term defined by the Kullback-Leibler divergence, making the problem more tractable via the celebrated Sinkhorn algorithm. Replacing the Kullback-Leibler divergence with a general $f$-divergence leads to a natural generalization. The case of divergences defined by superlinear functions was recently studied by Di Marino and Gerolin. Using convex analysis, we extend the theory developed so far to include all $f$-divergences defined by functions of Legendre type, and prove that under some mild conditions, strong duality holds, optimums in both the primal and dual problems are attained, the generalization of the $c$-transform is well-defined, and we give sufficient conditions for the generalized Sinkhorn algorithm to converge to an optimal solution. We propose a practical algorithm for computing an approximate solution of the optimal transport problem with $f$-divergence regularization via the generalized Sinkhorn algorithm. Finally, we present experimental results on synthetic 2-dimensional data, demonstrating the effects of using different $f$-divergences for regularization, which influences convergence speed, numerical stability and sparsity of the optimal coupling.