LGMay 3

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

arXiv:2605.0170225.8

AI Analysis

For practitioners and theorists, this provides a theoretical foundation that floating-point networks can approximate both function values and gradients, bridging the gap between ideal real-number theory and real-world implementations.

This work proves that floating-point neural networks with automatic differentiation can represent almost all floating-point functions and their gradients, extending universal approximation to practical finite-precision settings. The results hold for common activation functions like ReLU, ELU, GeLU, Swish, Sigmoid, and tanh.

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $ϕ$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(ϕ\circ f)$, respectively. We further extend this result: given $ϕ_1,\dots,ϕ_n$, $D^\mathtt{AD}(ϕ_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

View on arXiv PDF

Similar