CLAIOct 5, 2023

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

arXiv:2310.03477v18 citationsh-index: 19
Originality Incremental advance
AI Analysis

This incremental approach benefits low- and mid-resource languages by reducing data and time needed for training state-of-the-art models.

The paper tackles the challenge of training monolingual language models for low- and mid-resource languages by proposing a model conversion strategy that adapts high-resource models using a word translation dictionary, achieving new state-of-the-art performance on Dutch and Frisian across downstream tasks.

Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer. This one-to-many token mapping improves tremendously the initialization of the embedding table for the target language. We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian. These converted models achieve a new state-of-the-art performance on these languages across all sorts of downstream tasks. By reducing significantly the amount of data and time required for training state-of-the-art models, our novel model conversion strategy has the potential to benefit many languages worldwide.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes