CL ITJun 29, 2023

Tokenization and the Noiseless Channel

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Mrinmaya Sachan, Ryan Cotterell

ETH Zurich

arXiv:2306.16842v128.3248 citationsh-index: 40Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better tokenizer selection in NLP pipelines, offering a principled, information-theoretic approach, though it is incremental as it builds on existing tokenization methods.

The paper tackles the problem of understanding why certain subword tokenizers improve downstream NLP model performance, proposing that efficient tokenizers optimize channel usage measured by Rényi entropy, and finds a strong correlation (0.78) between Rényi entropy and BLEU scores in machine translation.

Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to \emph{efficient} channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the Rényi entropy with $α= 2.5$ has a very strong correlation with \textsc{Bleu}: $0.78$ in comparison to just $-0.32$ for compressed length.

View on arXiv PDF Code

Similar