CLOct 31, 2025

Probability Distributions Computed by Hard-Attention Transformers

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

arXiv:2510.27118v12.7h-index: 10

Originality Incremental advance

AI Analysis

This work addresses the gap in understanding transformer expressivity for language modeling, which is their primary use-case, though it appears incremental as it builds on existing recognizer results.

The paper characterizes the probability distributions that transformer language models can express, showing that autoregressive and probabilistic aspects can increase expressivity and break non-probabilistic equivalences.

Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

View on arXiv PDF

Similar