CL OCJun 29, 2023

A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, Ryan Cotterell

ETH Zurich

arXiv:2306.16837v327.5234 citationsh-index: 40Has Code

Originality Highly original

AI Analysis

This work addresses the lack of formal understanding and efficiency in BPE, a widely used tokenization method in NLP, offering theoretical insights and practical improvements.

The paper formalizes Byte-Pair Encoding (BPE) as a combinatorial optimization problem, proving that its greedy version approximates an optimal merge sequence with a theoretical bound of about 0.37 and providing a faster implementation that reduces runtime from O(NM) to O(N log M).

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{σ(\boldsymbolμ^\star)}(1-e^{-{σ(\boldsymbolμ^\star)}})$-approximation of an optimal merge sequence, where ${σ(\boldsymbolμ^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbolμ^\star$. Empirically the lower bound of the approximation is $\approx 0.37$. We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$, where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

View on arXiv PDF Code

Similar