CLLGJan 19

Trust Me, I'm an Expert: Decoding and Steering Authority Bias in Large Language Models

arXiv:2601.13433v1
Originality Incremental advance
AI Analysis

This addresses the problem of bias in AI decision-making for users relying on language models in critical domains, though it is incremental as it builds on prior work on suggestion influences.

The study investigated whether large language models exhibit authority bias by being more susceptible to incorrect endorsements from high-expertise sources, finding that models showed increased accuracy degradation and confidence in wrong answers with higher authority across mathematical, legal, and medical reasoning tasks. It also demonstrated that this bias can be mitigated through steering, improving performance even with misleading expert endorsements.

Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes