The Narcissus Hypothesis: Descending to the Rung of Illusion
This addresses the issue of corpus integrity and reliability of downstream inferences for users of AI models, but it is incremental as it builds on existing concerns about bias in alignment.
The paper tackles the problem of social desirability bias in foundational models, hypothesizing that recursive alignment induces models to favor agreeable responses over objective reasoning, and finds significant drift toward socially conforming traits across 31 models.
Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl's Ladder of Causality, culminating in what we refer to as the Rung of Illusion.