LG AI CL CYDec 1, 2025

Do Large Language Models Walk Their Talk? Measuring the Gap Between Implicit Associations, Self-Report, and Behavioral Altruism

arXiv:2512.01568v1

Originality Incremental advance

AI Analysis

This addresses the problem of assessing and aligning LLM behavior for researchers and developers, though it is incremental as it applies existing psychological methods to LLMs.

The study investigated whether Large Language Models (LLMs) exhibit altruistic tendencies and found that while models show strong implicit pro-altruism bias and behave more altruistically than chance, their implicit associations do not predict behavior, and they systematically overestimate their own altruism, with a significant 'virtue signaling gap' affecting 75% of models.

We investigate whether Large Language Models (LLMs) exhibit altruistic tendencies, and critically, whether their implicit associations and self-reports predict actual altruistic behavior. Using a multi-method approach inspired by human social psychology, we tested 24 frontier LLMs across three paradigms: (1) an Implicit Association Test (IAT) measuring implicit altruism bias, (2) a forced binary choice task measuring behavioral altruism, and (3) a self-assessment scale measuring explicit altruism beliefs. Our key findings are: (1) All models show strong implicit pro-altruism bias (mean IAT = 0.87, p < .0001), confirming models "know" altruism is good. (2) Models behave more altruistically than chance (65.6% vs. 50%, p < .0001), but with substantial variation (48-85%). (3) Implicit associations do not predict behavior (r = .22, p = .29). (4) Most critically, models systematically overestimate their own altruism, claiming 77.5% altruism while acting at 65.6% (p < .0001, Cohen's d = 1.08). This "virtue signaling gap" affects 75% of models tested. Based on these findings, we recommend the Calibration Gap (the discrepancy between self-reported and behavioral values) as a standardized alignment metric. Well-calibrated models are more predictable and behaviorally consistent; only 12.5% of models achieve the ideal combination of high prosocial behavior and accurate self-knowledge.

View on arXiv PDF

Similar