CVIVOct 24, 2025

Caption-Driven Explainability: Probing CNNs for Bias via CLIP

arXiv:2510.22035v4h-index: 81Has Code2025 IEEE International Conference on Image Processing Workshops (ICIPW)
Originality Incremental advance
AI Analysis

This addresses robustness issues in ML models for computer vision, but it is incremental as it builds on existing XAI and CLIP methods.

The paper tackles the problem of misleading saliency maps in explainable AI for computer vision by proposing a caption-based method that integrates a model into CLIP to identify dominant concepts, reducing the risk of covariate shift and improving robustness.

Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes