On Approximation Capabilities of ReLU Activation and Softmax Output Layer in Neural Networks
This provides the first theoretical justification for using softmax output layers in neural networks for pattern classification, addressing a foundational gap in machine learning theory.
The paper extended universal approximator theory to neural networks with ReLU activation and softmax output layers, proving they can approximate any function in L^1 up to arbitrary precision and any indicator function for multi-class classification.
In this paper, we have extended the well-established universal approximator theory to neural networks that use the unbounded ReLU activation function and a nonlinear softmax output layer. We have proved that a sufficiently large neural network using the ReLU activation function can approximate any function in $L^1$ up to any arbitrary precision. Moreover, our theoretical results have shown that a large enough neural network using a nonlinear softmax output layer can also approximate any indicator function in $L^1$, which is equivalent to mutually-exclusive class labels in any realistic multiple-class pattern classification problems. To the best of our knowledge, this work is the first theoretical justification for using the softmax output layers in neural networks for pattern classification.