CVAILGNov 24, 2025

CLASH: A Benchmark for Cross-Modal Contradiction Detection

arXiv:2511.19199v11 citations
Originality Incremental advance
AI Analysis

This addresses the need for reliable cross-modal contradiction detection in AI systems, though it is incremental as it builds on existing multimodal benchmarks.

The authors tackled the problem of detecting contradictions between images and text, a key challenge for preventing AI hallucinations, by introducing CLASH, a benchmark with COCO images paired with contradictory captions. Their analysis showed that state-of-the-art models have significant limitations in this task, but fine-tuning on CLASH substantially improved performance.

Contradictory multimodal inputs are common in real-world settings, yet existing benchmarks typically assume input consistency and fail to evaluate cross-modal contradiction detection - a fundamental capability for preventing hallucinations and ensuring reliability. We introduce CLASH, a novel benchmark for multimodal contradiction detection, featuring COCO images paired with contradictory captions containing controlled object-level or attribute-level contradictions. The samples include targeted questions evaluated in both multiple-choice and open-ended formats. The benchmark provides an extensive fine-tuning set filtered through automated quality checks, alongside a smaller human-verified diagnostic set. Our analysis of state-of-the-art models reveals substantial limitations in recognizing cross-modal conflicts, exposing systematic modality biases and category-specific weaknesses. Furthermore, we empirically demonstrate that targeted fine-tuning on CLASH substantially enhances conflict detection capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes