CLAILGAug 6, 2025

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

arXiv:2508.04796v29 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses fairness issues in NLP for users of lower-resource languages, though it is an incremental improvement over existing BPE methods.

The paper tackled the problem of cross-lingual unfairness in tokenization, where standard algorithms favor dominant languages, and introduced Parity-aware Byte Pair Encoding (BPE) to improve fairness by prioritizing the worst-compressed language at each merge step, resulting in more equitable token counts with negligible impact on global compression and downstream performance.

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes