LGAIFeb 15

You Can Learn Tokenization End-to-End with Reinforcement Learning

arXiv:2602.13940v1
Originality Highly original
AI Analysis

This work addresses the bottleneck of hardcoded tokenization in LLMs, offering a more theoretically grounded approach for researchers and practitioners, though it is incremental as it builds on existing methods.

The paper tackles the problem of learning token boundaries end-to-end in LLMs by using reinforcement learning techniques, specifically score function estimates with time discounting, and demonstrates that this method outperforms prior straight-through estimates at the 100 million parameter scale.

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes