CLAug 21, 2024

Distributional Properties of Subword Regularization

ETH Zurich
arXiv:2408.11443v125 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses a limitation in NLP tokenization for researchers and practitioners, offering an incremental improvement to enhance model performance.

The paper tackled the problem of biased tokenization distributions in stochastic subword regularization methods like BPE and MaxMatch, showing they favor a small set of tokenizations per word, and proposed a uniform sampling algorithm that improved machine translation quality.

Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes