CLApr 29, 2025

Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training

Linjuan Wu, Haoran Wei, Huan Lin, Tianhao Li, Baosong Yang, Fei Huang, Weiming Lu

arXiv:2504.20484v29.65 citationsh-index: 28EMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving multilingual capabilities in LLMs for broader linguistic and domain coverage, though it appears incremental as it builds on existing pre-training methods with a novel data construction approach.

The paper tackled the problem of limited cross-lingual transfer in large language models due to constraints in parallel resources, proposing Cross-lingual In-context Pre-training (CrossIC-PT) to enhance multilingual performance, which resulted in performance gains of up to 3.99% across models and languages.

Large language models (LLMs) exhibit remarkable multilingual capabilities despite English-dominated pre-training, attributed to cross-lingual mechanisms during pre-training. Existing methods for enhancing cross-lingual transfer remain constrained by parallel resources, suffering from limited linguistic and domain coverage. We propose Cross-lingual In-context Pre-training (CrossIC-PT), a simple and scalable approach that enhances cross-lingual transfer by leveraging semantically related bilingual texts via simple next-word prediction. We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window. To access window size constraints, we implement a systematic segmentation policy to split long bilingual document pairs into chunks while adjusting the sliding window mechanism to preserve contextual coherence. We further extend data availability through a semantic retrieval framework to construct CrossIC-PT samples from web-crawled corpus. Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models (Llama-3.1-8B, Qwen2.5-7B, and Qwen2.5-1.5B) across six target languages, yielding performance gains of 3.79%, 3.99%, and 1.95%, respectively, with additional improvements after data augmentation.

View on arXiv PDF

Similar