Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training
This addresses the inefficiency in neural network design for practitioners by enabling smaller, more effective networks, though it is incremental as it builds on known theoretical properties of ReLU networks.
The authors tackled the problem that randomly initialized ReLU networks rarely achieve the exponential number of linear regions possible with depth, leading to unnecessarily large networks. They introduced a novel parameterization that ensures a depth d network produces exactly 2^d linear regions at initialization and maintains them during training, resulting in approximations of convex 1D functions that are orders of magnitude more accurate than random initialization.
In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.