Lek-Heng Lim, Mateusz Michalek, Yang Qi
We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set $s_1, \dots, s_n \in \mathbb{R}^p$ with corresponding responses $t_1,\dots,t_n \in \mathbb{R}^q$, fitting a $k$-layer neural network $ν_θ: \mathbb{R}^p \to \mathbb{R}^q$ involves estimation of the weights $θ\in \mathbb{R}^m$ via an ERM: \[ \inf_{θ\in \mathbb{R}^m} \; \sum_{i=1}^n \lVert t_i - ν_θ(s_i) \rVert_2^2. \] We show that even for $k = 2$, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. A high-level explanation is like that for the nonexistence of best rank-$r$ approximations of higher-order tensors --- the set of parameters is not a closed set --- but the geometry involved for best $k$-layer neural networks approximations is more subtle. In addition, we show that for smooth activations $σ(x)= 1/\bigl(1 + \exp(-x)\bigr)$ and $σ(x)=\tanh(x)$, such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation $σ(x)=\max(0,x)$, we completely classifying cases where the ERM for a best two-layer neural network approximation attains its infimum. As an aside, we obtain a precise description of the geometry of the space of two-layer neural networks with $d$ neurons in the hidden layer: it is the join locus of a line and the $d$-secant locus of a cone.