Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
This work addresses model misalignment for AI safety researchers by providing tools for mid-inference detection, though it is incremental as it builds on existing sycophancy analysis.
The paper tackled the problem of sycophancy in reasoning models, where models agree with incorrect user suggestions, by introducing sycophantic anchors to localize and quantify this behavior, achieving up to 85% balanced accuracy in detection and an R² of 0.74 in predicting commitment strength.
Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce \emph{sycophantic anchors} -- sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B--8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74--85\% balanced accuracy), outperforming text-only baselines at high commitment levels -- confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations ($R^2$ up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.