CLMay 23, 2023

FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models

arXiv:2305.14481v2140 citations
Originality Incremental advance
AI Analysis

This addresses a bottleneck for low-resource language modeling by enabling effective embedding transfer with new tokenizers, though it is an incremental improvement over existing initialization techniques.

The paper tackles the problem of initializing embeddings when using a new tokenizer for monolingual specialization of multilingual models, proposing FOCUS to represent new tokens as combinations of overlapping tokens based on semantic similarity. The result shows that FOCUS outperforms random initialization and previous methods in language modeling and downstream tasks like NLI, QA, and NER.

Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model's embedding matrix. In this paper, we propose FOCUS - Fast Overlapping Token Combinations Using Sparsemax, a novel embedding initialization method that initializes the embedding matrix effectively for a new tokenizer based on information in the source model's embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work in language modeling and on a range of downstream tasks (NLI, QA, and NER).

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes