CL DCJul 16, 2025

BlockBPE: Parallel BPE Tokenization

arXiv:2507.11941v19.64 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses a bottleneck in GPU batch inference workflows for large language models, though it is incremental as it builds on existing BPE methods with optimizations.

The paper tackled the problem of CPU-bound tokenization in large language model pipelines by presenting BlockBPE, a parallel GPU implementation of byte-pair encoding that achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers on high-batch inference workloads.

Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI's tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit $O(n \log n)$ runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to $O(nd)$ where $d \ll n$. On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.

View on arXiv PDF

Similar