Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling
This work addresses the challenge of leveraging visual information for Chinese language modeling, offering an incremental alternative to traditional tokenization methods.
The paper tackled the problem of representing Chinese characters in language models by using low-resolution visual tokens instead of discrete index-based tokens, achieving 39.2% accuracy comparable to the baseline of 39.1% and showing a hot-start effect with 12% accuracy early in training.
Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as $8 \times 8$ pixels. Remarkably, these inputs achieve 39.2\% accuracy, comparable to the index-based baseline of 39.1\%. Such low-resource settings also exhibit a pronounced \emph{hot-start} effect: by 0.4\% of total training, accuracy reaches above 12\%, while index-based models lag at below 6\%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.