IVAILGMay 8

CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks

arXiv:2605.0832535.1
Predicted impact top 51% in IV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners building interpretable vision models, CAMAL provides a scalable method to enforce spatially accurate and causally meaningful attention using widely available segmentation masks.

CAMAL uses segmentation masks as an auxiliary regularizer to improve both attention alignment and faithfulness in vision models, achieving over 35% improvement in attention faithfulness compared to recent work while maintaining or improving generalization without extra inference cost.

Many vision datasets now provide segmentation masks in addition to annotated images to support a wide range of tasks. In this work, we propose Class Activation Map Attention Learning (CAMAL), an efficient and scalable method that utilizes segmentation masks to improve attention alignment and faithfulness in vision models. Specifically, attention alignment refers to the degree to which a model's attention aligns with ground-truth discriminative regions, while attention faithfulness refers to the degree to which a model's attention influences its decision. Improving both attention alignment and faithfulness is essential for ensuring that model attention is both spatially accurate and causally meaningful. To improve attention alignment and faithfulness in vision models, CAMAL first extracts the model's attention for each image during training and then compares the attention to ground-truth discriminative regions obtained from the corresponding segmentation masks. CAMAL then acts as an auxiliary regularizer, encouraging attention that aligns with ground-truth discriminative regions, while suppressing attention elsewhere. We evaluated CAMAL across two learning paradigms -- Deep Learning (DL) and Deep Reinforcement Learning (DRL) -- and observed consistent, significant improvements in both attention alignment and faithfulness. In particular, CAMAL yields statistically significant gains in attention alignment across all settings, and improves attention faithfulness by over 35% compared to recent work. Moreover, we show that improved attention alignment and faithfulness enhance explainability, while yielding improved or comparable generalization performance without increasing inference cost. These findings demonstrate that the spatial information contained within segmentation masks can be effectively leveraged to guide model attention across learning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes