CLOct 31, 2025

Probability Distributions Computed by Hard-Attention Transformers

arXiv:2510.27118v1h-index: 10
Originality Incremental advance
AI Analysis

This work addresses the gap in understanding transformer expressivity for language modeling, which is their primary use-case, though it appears incremental as it builds on existing recognizer results.

The paper characterizes the probability distributions that transformer language models can express, showing that autoregressive and probabilistic aspects can increase expressivity and break non-probabilistic equivalences.

Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes