CLFeb 3

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

arXiv:2602.03352v11.11 citationsh-index: 7

Originality Highly original

AI Analysis

This work addresses a specific bottleneck in machine translation for researchers and practitioners, offering an incremental improvement over existing RL methods.

The paper tackles the challenge of noisy learning signals and large trajectory space in reinforcement learning for machine translation by introducing PEGRL, a two-stage RL framework that uses post-editing as an auxiliary task to stabilize training and guide optimization, resulting in consistent gains over RL baselines across multiple language pairs, with performance on English→Turkish comparable to advanced LLM-based systems like DeepSeek-V3.2.

Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).

View on arXiv PDF

Similar