Proximal Policy Optimization and its Dynamic Version for Sequence Generation
This work addresses sequence generation optimization for researchers and practitioners, but it is incremental as it adapts an existing reinforcement learning method to a specific domain.
The paper tackled the problem of optimizing sequence generation models by replacing policy gradient with proximal policy optimization (PPO) and proposing a dynamic version (PPO-dynamic), showing that these methods outperform policy gradient in stability and performance on tasks like synthetic experiments and chit-chat chatbots.
In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance.