CLAICCFLLGApr 23, 2024

Transformers Can Represent $n$-gram Language Models

AI2ETH Zurich
arXiv:2404.14994v337 citationsh-index: 10NAACL
Originality Incremental advance
AI Analysis

This work addresses a foundational gap in understanding transformer language models for researchers, though it is incremental as it builds on existing formal analysis.

The paper tackled the problem of analyzing transformer language models as probability distributions over strings, showing that transformers with hard or sparse attention can exactly represent any n-gram language model, providing a concrete lower bound on their representational capacity.

Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes