CLJul 18, 2022

MAD for Robust Reinforcement Learning in Machine Translation

DeepMind
arXiv:2207.08583v18 citationsh-index: 77
Originality Incremental advance
AI Analysis

This work addresses training challenges in reinforcement learning for machine translation, offering a robust method that improves stability and generalization, though it is incremental as it builds on existing policy gradient approaches.

The authors tackled the problem of training instability and poor generalization in reward-aware machine translation by introducing the MAD algorithm, which outperformed existing methods like REINFORCE, MRT, and PPO in experiments across various translation tasks, showing strong performance with both greedy decoding and beam search.

We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes