MLCLLGJun 1, 2023

Birth of a Transformer: A Memory Viewpoint

arXiv:2306.00802v2168 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses the need to understand transformer internal mechanisms for improved reliability, focusing on a synthetic memory problem.

The paper investigates how transformers balance stored knowledge from training data with adaptation to new context, using a synthetic setup with global and context-specific bigram distributions. Through empirical analysis of a two-layer transformer, it shows fast learning of global bigrams and slower development of an induction head mechanism for in-context bigrams, with theoretical insights on gradient-based learning.

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes