Semi-Offline Reinforcement Learning for Optimized Text Generation
This addresses the efficiency problem in RL for text generation, though it appears incremental as it builds on existing online and offline settings.
The paper tackles the trade-off between exploration and training cost in reinforcement learning by introducing a semi-offline RL paradigm, achieving comparable or better performance than state-of-the-art methods in experiments.
In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.