CLMar 24, 2025

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

arXiv:2503.18603v2h-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge for developers relying on embedding-based models in non-English contexts, offering a practical enhancement over using English datasets alone.

The paper tackles the problem of improving non-English language models by aligning English and target language embeddings, resulting in significant performance gains for Korean, Japanese, and Chinese models.

While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes