LGJun 9, 2024

Attention as a Hypernetwork

Simon Schug, Seijin Kobayashi, Yassir Akram, João Sacramento, Razvan Pascanu

arXiv:2406.05816v415.013 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of understanding and enhancing compositional generalization in transformers for AI researchers, though it is incremental as it builds on existing attention mechanisms.

The study investigated how transformers achieve compositional generalization by reformulating multi-head attention as a hypernetwork, revealing that a low-dimensional latent code predicts performance on unseen task compositions, and modifying the hypernetwork to be nonlinear improved generalization on abstract reasoning tasks, with experiments on a symbolic Raven's Progressive Matrices test showing gains through scaling model size and data.

Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven's Progressive Matrices human intelligence test, which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.

View on arXiv PDF Code

Similar