Representation Benefits of Deep Feedforward Networks
This work addresses a foundational theoretical problem in machine learning by providing a concrete example that demonstrates the superior representational power of deep networks over shallow ones, which is incremental as it builds on existing theoretical frameworks.
The paper tackles the problem of understanding the representational advantages of deep networks over shallow ones by constructing a family of classification problems where shallow networks require exponentially many nodes to achieve low error, while deep networks with a small constant number of nodes per layer achieve zero error. The result shows that deep networks can solve these problems with zero error using only 2 nodes per layer over 2k layers, compared to shallow networks needing exponentially many nodes to avoid at least 1/6 error.
This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.