ViLBias: Detecting and Reasoning about Bias in Multimodal Content
This work addresses bias detection in multimodal content for news analysis, offering a scalable benchmark and baselines, but it is incremental as it builds on existing VQA-style methods and tuning strategies.
The authors tackled the problem of detecting bias in multimodal news by introducing ViLBias, a benchmark and framework for reasoning over text-image pairs, which improved detection accuracy by 3-5% and showed that parameter-efficient tuning methods recover 97-99% of full fine-tuning performance with less than 5% trainable parameters.
Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with $<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.