NEAICLDec 19, 2024

Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture

arXiv:2412.15113v24 citationsh-index: 42Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing ICL efficiency in language models, which is crucial for tasks requiring adaptation to unseen data, though it appears incremental as it builds on existing attention mechanisms.

The authors tackled the problem of improving in-context learning (ICL) in large language models by introducing a novel residual stream architecture inspired by associative memory models, which allows direct information flow between attention heads and leads to faster ICL manifestation during training and improved performance in models with up to 1 billion parameters.

Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million and 1 billion parameters, focusing on attention head values, with results also indicating improved performance at these larger and more naturalistic scales.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes