Incorporating Context into Subword Vocabularies
This addresses the issue of tokenizers being misaligned with contextual language models, offering a more cohesive approach for NLP practitioners, though it is incremental as it builds on existing tokenizer frameworks.
The paper tackles the problem of subword tokenizers ignoring contextual information during vocabulary creation, and presents SaGe, a tokenizer that incorporates contextual signals, resulting in improved performance on tasks like English GLUE classification and Turkish NER.
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.