CLJun 30, 2023

Should you marginalize over possible tokenizations?

arXiv:2306.17757v1229 citationsh-index: 24
Originality Synthesis-oriented
AI Analysis

This addresses a foundational issue in language modeling for researchers and practitioners, but the findings are incremental as they confirm existing practices with minor exceptions.

The paper investigates whether ignoring the marginalization over possible tokenizations when computing string probabilities in autoregressive language models is justified, finding that the log-likelihood gap is typically no larger than 0.5% but increases for data with long complex words.

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes