Towards Understanding Learning in Neural Networks with Linear Teachers
This work addresses a foundational problem in deep learning theory regarding the learnability of linearly separable data by neural networks, providing theoretical guarantees for the optimization and properties of the learned solution.
This paper proves that SGD globally optimizes a two-layer network with Leaky ReLU activations for linearly separable data, a previously unsolved problem. It also provides theoretical support for the empirical observation that the learned network often results in an approximately linear decision boundary, showing this occurs if network weights converge to two clusters.
Can a neural network minimizing cross-entropy learn linearly separable data? Despite progress in the theory of deep learning, this question remains unsolved. Here we prove that SGD globally optimizes this learning problem for a two-layer network with Leaky ReLU activations. The learned network can in principle be very complex. However, empirical evidence suggests that it often turns out to be approximately linear. We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary. Finally, we show a condition on the optimization that leads to weight clustering. We provide empirical results that validate our theoretical analysis.