LGJun 9, 2024

Attention as a Hypernetwork

arXiv:2406.05816v413 citations
Originality Incremental advance
AI Analysis

This addresses the problem of understanding and enhancing compositional generalization in transformers for AI researchers, though it is incremental as it builds on existing attention mechanisms.

The study investigated how transformers achieve compositional generalization by reformulating multi-head attention as a hypernetwork, revealing that a low-dimensional latent code predicts performance on unseen task compositions, and modifying the hypernetwork to be nonlinear improved generalization on abstract reasoning tasks, with experiments on a symbolic Raven's Progressive Matrices test showing gains through scaling model size and data.

Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven's Progressive Matrices human intelligence test, which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes