LGOct 12, 2023

Interpreting Learned Feedback Patterns in Large Language Models

arXiv:2310.08164v65 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses the safety of LLMs by aiming to reduce discrepancies between model behavior and training objectives, though it is incremental as it builds on existing RLHF methods.

The paper investigates whether large language models (LLMs) accurately learn human preferences from reinforcement learning from human feedback (RLHF) by analyzing Learned Feedback Patterns (LFPs) in activations, finding that probes can estimate feedback signals with accuracy comparable to GPT-4's classifications.

Reinforcement learning from human feedback (RLHF) is widely used to train large language models (LLMs). However, it is unclear whether LLMs accurately learn the underlying preferences in human feedback data. We coin the term \textit{Learned Feedback Pattern} (LFP) for patterns in an LLM's activations learned during RLHF that improve its performance on the fine-tuning task. We hypothesize that LLMs with LFPs accurately aligned to the fine-tuning feedback exhibit consistent activation patterns for outputs that would have received similar feedback during RLHF. To test this, we train probes to estimate the feedback signal implicit in the activations of a fine-tuned LLM. We then compare these estimates to the true feedback, measuring how accurate the LFPs are to the fine-tuning feedback. Our probes are trained on a condensed, sparse and interpretable representation of LLM activations, making it easier to correlate features of the input with our probe's predictions. We validate our probes by comparing the neural features they correlate with positive feedback inputs against the features GPT-4 describes and classifies as related to LFPs. Understanding LFPs can help minimize discrepancies between LLM behavior and training objectives, which is essential for the safety of LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes