Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
This work addresses the need for more nuanced evaluation of AI moral understanding, moving beyond deterministic ground truth to account for human disagreement, which is incremental but important for AI ethics and benchmarking.
The paper tackled the problem of evaluating how well large language models understand moral values compared to humans, using a Bayesian framework to model annotator disagreements. The result showed that AI models rank among the top 25% of human annotators and produce fewer false negatives, indicating more sensitive moral detection capabilities.
How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.