CL AIJun 17, 2024

How Can We Effectively Expand the Vocabulary of LLMs with 0.01GB of Target Language Text?

Atsuki Yamaguchi, Aline Villavicencio, Nikolaos Aletras

arXiv:2406.11477v311.923 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the issue of higher inference costs for non-English speakers due to English-centric tokenizers, though it is incremental as it builds on existing vocabulary expansion methods.

The paper tackles the problem of vocabulary expansion for large language models (LLMs) in low-resource settings, achieving faster inference while maintaining competitive downstream performance using only 30K sentences (~0.01GB) of target language text.

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences ($\sim$0.01GB text data) from the target language.

View on arXiv PDF Code

Similar