The Hamilton-Jacobi Theory of Deep Learning

arXiv:2605.2898349.5h-index: 13
AI Analysis

Provides a unified theoretical framework connecting neural networks, tropical algebra, viscous PDEs, and convex optimization, offering quantitative insights into generalization, robustness, and interpretability.

This paper establishes an exact correspondence between training neural networks and solving Hamilton-Jacobi equations, showing that gradient steps select initial data for viscous Hamilton-Jacobi equations. Key results include minimax optimal generalization rate O(n^{-1/(d+2)}), adversarial robustness controlled by a deformation parameter ε, and a closed-form O(N) influence function with fold bifurcations.

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $π_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes