CL LGJun 5, 2022

Offline RL for Natural Language Generation with Implicit Language Q Learning

Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, Sergey Levine

Berkeley

arXiv:2206.11871v216.5149 citationsh-index: 112

Originality Incremental advance

AI Analysis

This work addresses the inconsistency of large language models in task completion for users, offering a novel offline RL approach that is more effective than prior methods in specific natural language generation settings, though it is incremental as it builds on existing RL and supervised learning frameworks.

The paper tackles the problem of making large language models more consistent in completing user-specified tasks by proposing implicit language Q-learning (ILQL), an offline reinforcement learning method that combines RL's utility maximization with supervised learning's stability, resulting in improved effectiveness for tasks like end-to-end dialogue and optimizing high-variance rewards such as toxicity labeling.

Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not.

View on arXiv PDF

Similar