CLLGMar 13

Token Distillation: Attention-aware Input Embeddings For New Tokens

arXiv:2505.2013345.43 citations
AI Analysis

This addresses the issue of vocabulary inflexibility in language models for domains with underrepresented tokens, though it appears incremental as it builds on existing embedding initialization methods.

The paper tackles the problem of language models' static vocabularies causing performance drops and higher costs for underrepresented domains by proposing Token Distillation to quickly learn high-quality input embeddings for new tokens, with experimental results showing it outperforms strong baselines across various open-weight models.

Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes