CLAIDec 9, 2025

Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

arXiv:2512.08777v1h-index: 27
Originality Incremental advance
AI Analysis

This addresses the challenge of developing fluent preference-aligned models for lower-resource languages, which often lack native datasets, though it is incremental as it builds on existing preference-optimization methods.

The paper tackles the problem of aligning language models for lower-resource languages without native instruction-tuning data, proposing a post-training method that uses on-policy training to preserve fluency, and results show it outperforms alternatives like supervised finetuning on machine-translated data in a case study on Norwegian Bokmål.

We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes