The Polite Liar: Epistemic Pathology in Language Models
This addresses the problem of epistemic pathology in AI systems for researchers and developers, highlighting a structural issue in alignment methods.
The paper argues that large language models exhibit confident fabrication due to reinforcement learning from human feedback (RLHF), which optimizes for perceived sincerity over evidential accuracy, revealing a tension between linguistic cooperation and epistemic integrity.
Large language models exhibit a peculiar epistemic pathology: they speak as if they know, even when they do not. This paper argues that such confident fabrication, what I call the polite liar, is a structural consequence of reinforcement learning from human feedback (RLHF). Building on Frankfurt's analysis of bullshit as communicative indifference to truth, I show that this pathology is not deception but structural indifference: a reward architecture that optimizes for perceived sincerity over evidential accuracy. Current alignment methods reward models for being helpful, harmless, and polite, but not for being epistemically grounded. As a result, systems learn to maximize user satisfaction rather than truth, performing conversational fluency as a virtue. I analyze this behavior through the lenses of epistemic virtue theory, speech-act philosophy, and cognitive alignment, showing that RLHF produces agents trained to mimic epistemic confidence without access to epistemic justification. The polite liar thus reveals a deeper alignment tension between linguistic cooperation and epistemic integrity. The paper concludes with an "epistemic alignment" principle: reward justified confidence over perceived fluency.