CLMay 26

KARMA: Karma-Aligned Reward Model Adaptation

arXiv:2605.267389.7

AI Analysis

For LLM alignment researchers, KARMA reveals a fundamental tension between pragmatic behavior and factuality embedded in reward signals.

KARMA adapts reward models using Reddit karma to improve LLM performance on pragmatics-mediated tasks, but finds that the best reward model for predicting karma does not yield the best downstream alignment; it also consistently reduces factuality across all conditions.

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.

View on arXiv PDF

Similar