CL AIJan 14

SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

arXiv:2601.09089v10.6h-index: 5

Originality Incremental advance

AI Analysis

This addresses a practical issue for developers and users of LLMs in real-world applications like text-based navigation, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem of large language models struggling with character-level tasks due to tokenization, by introducing SubTokenTest, a benchmark for practical sub-token understanding, and finds that advanced LLMs perform poorly on these tasks, with specific failure rates reported across nine models.

Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.

View on arXiv PDF

Similar