CLAIJan 18, 2024

Self-Rewarding Language Models

arXiv:2401.10020v3611 citationsICML
Originality Incremental advance
AI Analysis

This addresses the bottleneck of human-level feedback in AI training, enabling potential continual improvement in instruction following and reward quality, though it is incremental as it builds on existing DPO and prompting methods.

The paper tackles the problem of training superhuman agents by proposing Self-Rewarding Language Models, where the model provides its own rewards via LLM-as-a-Judge prompting, resulting in a fine-tuned Llama 2 70B model that outperforms systems like Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard.

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes