CLAICYHCSIJan 4, 2025

LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

arXiv:2501.03266v25 citationsh-index: 2Behaviour & Information Technology
Originality Incremental advance
AI Analysis

This addresses the problem of balancing LLM safety and user satisfaction for developers and policymakers, highlighting a tension in moderation strategies.

The study analyzed nearly 50,000 model comparisons from Chatbot Arena to investigate how users respond to LLM refusals, finding that ethical refusals yield significantly lower win rates than technical refusals and standard responses, indicating user dissatisfaction, but this penalty varies with prompt sensitivity and refusal phrasing.

LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. In particular, little is known about how users respond when models refuse to answer a prompt-one of the primary mechanisms used to enforce ethical boundaries in LLMs. We address this gap by analyzing nearly 50,000 model comparisons from Chatbot Arena, a platform where users indicate their preferred LLM response in pairwise matchups, providing a large-scale setting for studying real-world user preferences. Using a novel RoBERTa-based refusal classifier fine-tuned on a hand-labeled dataset, we distinguish between refusals due to ethical concerns and technical limitations. Our results reveal a substantial refusal penalty: ethical refusals yield significantly lower win rates than both technical refusals and standard responses, indicating that users are especially dissatisfied when models decline a task for ethical reasons. However, this penalty is not uniform. Refusals receive more favorable evaluations when the underlying prompt is highly sensitive (e.g., involving illegal content), and when the refusal is phrased in a detailed and contextually aligned manner. These findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations, calling for more adaptive moderation strategies that account for context and presentation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes