CLJun 4

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

arXiv:2606.0658637.9
Originality Incremental advance
AI Analysis

For multilingual LLM developers, this work addresses the problem of cross-lingual factual inconsistency, showing that GRPO outperforms standard fine-tuning methods, though the gains are incremental over existing approaches.

The paper introduces PolyFact, a parallel multilingual factual QA dataset, and uses GRPO reinforcement learning to improve cross-lingual factual recall in LLMs, achieving consistent gains over SFT and CPT, with GRPO improving cross-lingual consistency and generalization to unseen languages.

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes