Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
This addresses a specific bottleneck in LLMs for tasks requiring character-level understanding, such as Chinese Spelling Correction, but is incremental in improving existing tokenization approaches.
The paper tackles the problem of tokenization methods like BPE obscuring character structures in LLMs, which hinders precise character position prediction crucial for tasks like Chinese Spelling Correction. The result is that their proposed TIPA method significantly improves position prediction accuracy and boosts performance in character-level tasks.
Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs' ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models' ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer's vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.