On approximating $\nabla f$ with neural networks
This work addresses a foundational issue in neural network theory for researchers in machine learning, but it appears incremental as it builds on existing results about gradient approximation.
The paper tackles the problem of approximating gradients of smooth functions with neural networks, proving a theorem that networks with more than one hidden layer can only represent one feature in the first layer, which contrasts with known results for single-hidden-layer networks.
Consider a feedforward neural network $ψ: \mathbb{R}^d\rightarrow \mathbb{R}^d$ such that $ψ\approx \nabla f$, where $f:\mathbb{R}^d \rightarrow \mathbb{R}$ is a smooth function, therefore $ψ$ must satisfy $\partial_j ψ_i = \partial_i ψ_j$ pointwise. We prove a theorem that a $ψ$ network with more than one hidden layer can only represent one feature in its first hidden layer; this is a dramatic departure from the well-known results for one hidden layer. The proof of the theorem is straightforward, where two backward paths and a weight-tying matrix play the key roles. We then present the alternative, the implicit parametrization, where the neural network is $φ: \mathbb{R}^d \rightarrow \mathbb{R}$ and $\nabla φ\approx \nabla f$; in addition, a "soft analysis" of $\nabla φ$ gives a dual perspective on the theorem. Throughout, we come back to recent probabilistic models that are formulated as $\nabla φ\approx \nabla f$, and conclude with a critique of denoising autoencoders.