LGAIJul 26, 2024

When narrower is better: the narrow width limit of Bayesian parallel branching neural networks

arXiv:2407.18807v31 citationsh-index: 82
Originality Incremental advance
AI Analysis

This provides a counterintuitive insight for machine learning practitioners working with branching architectures like graph neural networks and residual networks, though it appears incremental as it extends known width limits to a new regime.

This work challenges the notion that larger network widths always improve generalization by showing that Bayesian Parallel Branching Neural Networks (BPB-NNs) in the narrow width limit can outperform or match wide width limits in bias-limited scenarios, with each branch learning more robustly due to symmetry breaking in kernel renormalization.

The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018)), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019)). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily in the branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and demonstrate that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes