CLApr 29, 2025

Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training

arXiv:2504.20484v25 citationsh-index: 28EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving multilingual capabilities in LLMs for broader linguistic and domain coverage, though it appears incremental as it builds on existing pre-training methods with a novel data construction approach.

The paper tackled the problem of limited cross-lingual transfer in large language models due to constraints in parallel resources, proposing Cross-lingual In-context Pre-training (CrossIC-PT) to enhance multilingual performance, which resulted in performance gains of up to 3.99% across models and languages.

Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes