CLOct 20, 2023
Towards Understanding Sycophancy in Language ModelsMrinank Sharma, Meg Tong, Tomasz Korbak et al. · openai, stanford
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
AIJul 17, 2023
Measuring Faithfulness in Chain-of-Thought ReasoningTamera Lanham, Anna Chen, Ansh Radhakrishnan et al. · berkeley, stanford
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
ACC-PHSep 10, 2022
Multipoint-BAX: A New Approach for Efficiently Tuning Particle Accelerator Emittance via Virtual ObjectivesSara A. Miskovich, Willie Neiswanger, William Colocho et al.
Although beam emittance is critical for the performance of high-brightness accelerators, optimization is often time limited as emittance calculations, commonly done via quadrupole scans, are typically slow. Such calculations are a type of $\textit{multipoint query}$, i.e. each query requires multiple secondary measurements. Traditional black-box optimizers such as Bayesian optimization are slow and inefficient when dealing with such objectives as they must acquire the full series of measurements, but return only the emittance, with each query. We propose a new information-theoretic algorithm, Multipoint-BAX, for black-box optimization on multipoint queries, which queries and models individual beam-size measurements using techniques from Bayesian Algorithm Execution (BAX). Our method avoids the slow multipoint query on the accelerator by acquiring points through a $\textit{virtual objective}$, i.e. calculating the emittance objective from a fast learned model rather than directly from the accelerator. We use Multipoint-BAX to minimize emittance at the Linac Coherent Light Source (LCLS) and the Facility for Advanced Accelerator Experimental Tests II (FACET-II). In simulation, our method is 20$\times$ faster and more robust to noise compared to existing methods. In live tests, it matched the hand-tuned emittance at FACET-II and achieved a 24% lower emittance than hand-tuning at LCLS. Our method represents a conceptual shift for optimizing multipoint queries, and we anticipate that it can be readily adapted to similar problems in particle accelerators and other scientific instruments.