CL AI LGMay 19, 2023

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy

arXiv:2305.11455v13.39 citations

Originality Incremental advance

AI Analysis

This work addresses the computational inefficiency of traditional RLHF fine-tuning for language models, though it remains incremental as it hasn't been scaled to practical systems.

The authors tackled the problem of fine-tuning language models by proposing a novel perspective where the language model serves simultaneously as policy, reward, and transition function, eliminating the need for separate reward models and downstream policy optimization. Their experiments demonstrated this approach through efficient exploration based on epistemic uncertainty, though they only tested it on a simple didactic data generating process.

A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.

View on arXiv PDF

Similar