CLAIHCDec 29, 2025

Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

arXiv:2512.23835v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the need for interpretability in bias detection systems for journalism, though it is incremental as it compares existing methods on known datasets.

The study tackled the problem of understanding how transformer-based models detect bias in news text by comparing two models using SHAP explanations, finding that a domain-adaptive model reduced false positives by 63% and revealed that errors stem from discourse-level ambiguity rather than explicit bias cues.

Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes