AI CL GNNov 7, 2025

Anchors in the Machine: Behavioral and Attributional Evidence of Anchoring Bias in LLMs

arXiv:2511.05766v11 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the risk of treating LLMs as human substitutes in applied domains, with incremental contributions to bridging behavioral science, LLM safety, and interpretability.

The paper tackled the problem of whether anchoring bias in large language models (LLMs) reflects deeper probability shifts or surface imitation, by analyzing behavioral and attributional evidence across six models, revealing robust anchoring effects in models like Gemma-2B and Llama-2-7B with measurable shifts in output distributions.

Large language models (LLMs) are increasingly examined as both behavioral subjects and decision systems, yet it remains unclear whether observed cognitive biases reflect surface imitation or deeper probability shifts. Anchoring bias, a classic human judgment bias, offers a critical test case. While prior work shows LLMs exhibit anchoring, most evidence relies on surface-level outputs, leaving internal mechanisms and attributional contributions unexplored. This paper advances the study of anchoring in LLMs through three contributions: (1) a log-probability-based behavioral analysis showing that anchors shift entire output distributions, with controls for training-data contamination; (2) exact Shapley-value attribution over structured prompt fields to quantify anchor influence on model log-probabilities; and (3) a unified Anchoring Bias Sensitivity Score integrating behavioral and attributional evidence across six open-source models. Results reveal robust anchoring effects in Gemma-2B, Phi-2, and Llama-2-7B, with attribution signaling that the anchors influence reweighting. Smaller models such as GPT-2, Falcon-RW-1B, and GPT-Neo-125M show variability, suggesting scale may modulate sensitivity. Attributional effects, however, vary across prompt designs, underscoring fragility in treating LLMs as human substitutes. The findings demonstrate that anchoring bias in LLMs is robust, measurable, and interpretable, while highlighting risks in applied domains. More broadly, the framework bridges behavioral science, LLM safety, and interpretability, offering a reproducible path for evaluating other cognitive biases in LLMs.

View on arXiv PDF

Similar