CLMay 20, 2025

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

arXiv:2505.14172v313 citationsh-index: 14Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses a fundamental limitation in tokenized language models that affects their ability to handle low-level perceptual tasks, with incremental improvements for AI systems requiring character-level understanding.

The paper tackles the problem of tokenized language models failing at simple character-level tasks like counting letters, showing that such capabilities emerge suddenly and late in training, and proposes a lightweight architectural modification that significantly improves character-level reasoning while preserving the advantages of subword models.

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge suddenly and only late in training. We find that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes