CVApr 23, 2024

CA-Stream: Attention-based pooling for interpretable image recognition

arXiv:2404.14996v13 citationsh-index: 43XAI4CV
Originality Incremental advance
AI Analysis

This work addresses interpretability for image recognition users, but it is incremental as it builds on existing transformer-based architectures.

The authors tackled the problem of improving interpretability in image recognition models by designing an attention-based pooling mechanism called CA-Stream to replace Global Average Pooling, resulting in enhanced interpretability while preserving recognition performance.

Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes