CLJun 12, 2025

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

arXiv:2506.10641v14 citationsh-index: 2EMNLP
Originality Incremental advance
AI Analysis

This work addresses a fundamental limitation in LLMs' tokenization capabilities, which is incremental as it builds on existing understanding of model internals.

The study investigated how large language models (LLMs) handle character-level information during token spelling-out, finding that they struggle with complex tasks like identifying subcomponents and rely on higher Transformer layers rather than the embedding layer for reconstruction.

Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes