When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
For human-AI collaborative moderation systems, this work provides a preliminary method to identify when human judgment is needed based on the structure of agent disagreement, rather than treating all disagreement as noise.
The paper proposes that disagreement among LLM-based agents can signal genuine value pluralism in hate speech moderation, rather than noise. Using a taxonomy of disagreement patterns, they find that cases where agents agree on a verdict show significantly lower human disagreement (Cohen's d > 0.8) compared to cases where they disagree.
When LLM-based multi-agent systems disagree, current practice treats this as noise to be resolved through consensus. We propose it can be signal. We focus on hate speech moderation, a domain where judgments depend on cultural context and individual value weightings, producing high legitimate disagreement among human annotators. We hypothesize that convergent disagreement, where agents reason similarly but conclude differently, indicates genuine value pluralism that humans also struggle to resolve. Using the Measuring Hate Speech corpus, we embed reasoning traces from five perspective-differentiated agents and classify disagreement patterns using a four-category taxonomy based on reasoning similarity and conclusion agreement. We find that raw reasoning divergence weakly predicts human annotator conflict, but the structure of agent discord carries additional signal: cases where agents agree on a verdict show markedly lower human disagreement than cases where they do not, with large effect sizes (d>0.8) surviving correction for multiple comparisons. Our taxonomy-based ordering correlates with human disagreement patterns. These preliminary findings motivate a shift from consensus-seeking to uncertainty-surfacing multi-agent design, where disagreement structure - not magnitude - guides when human judgment is needed.