LGNov 17, 2022

On the Sample Complexity of Two-Layer Networks: Lipschitz vs. Element-Wise Lipschitz Activation

arXiv:2211.09634v43.32 citationsh-index: 23

Originality Incremental advance

AI Analysis

This work addresses theoretical foundations for neural network generalization, providing insights into activation function design, but it is incremental as it builds on existing norm-based bounds.

The paper tackles the sample complexity of two-layer neural networks by analyzing how activation functions affect it, showing that element-wise Lipschitz activations lead to logarithmic dependency on width, while non-element-wise ones can cause linear dependency.

We investigate the sample complexity of bounded two-layer neural networks using different activation functions. In particular, we consider the class $$ \mathcal{H} = \left\{\textbf{x}\mapsto \langle \textbf{v}, σ\circ W\textbf{b} + \textbf{b} \rangle : \textbf{b}\in\mathbb{R}^d, W \in \mathbb{R}^{\mathcal{T}\times d}, \textbf{v} \in \mathbb{R}^{\mathcal{T}}\right\} $$ where the spectral norm of $W$ and $\textbf{v}$ is bounded by $O(1)$, the Frobenius norm of $W$ is bounded from its initialization by $R > 0$, and $σ$ is a Lipschitz activation function. We prove that if $σ$ is element-wise, then the sample complexity of $\mathcal{H}$ has only logarithmic dependency in width and that this complexity is tight, up to logarithmic factors. We further show that the element-wise property of $σ$ is essential for a logarithmic dependency bound in width, in the sense that there exist non-element-wise activation functions whose sample complexity is linear in width, for widths that can be up to exponential in the input dimension. For the upper bound, we use the recent approach for norm-based bounds named Approximate Description Length (ADL) by arXiv:1910.05697. We further develop new techniques and tools for this approach that will hopefully inspire future works.

View on arXiv PDF

Similar