CLOct 12, 2024

Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?

arXiv:2410.09644v313 citationsh-index: 36ICLR
Originality Incremental advance
AI Analysis

This addresses the challenge of adapting language models to diverse languages with varying scripts and resource levels, offering a scalable solution, though it is incremental as it builds on existing adapter-based methods.

The paper tackles the problem of vocabulary adaptation for pre-trained language models to expand to new languages and reduce token fragmentation, proposing VocADT which outperforms the original Mistral model and other baselines across 11 languages in multilingual tasks like natural language understanding and machine translation.

Vocabulary adaptation, which integrates new vocabulary into pre-trained language models, enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristics or external embeddings. We propose VocADT, a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings while keeping the model's weights fixed. VocADT offers a flexible and scalable solution without depending on external resources or language constraints. Across 11 languages-with diverse scripts, resource availability, and fragmentation-we demonstrate that VocADT outperforms the original Mistral model and other baselines across various multilingual tasks including natural language understanding and machine translation. We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation. We further fine-tune the adapted model on the generative task of machine translation and find that vocabulary adaptation is still beneficial after fine-tuning and that VocADT is the most effective.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes