CVAILGApr 13, 2024

Understanding Multimodal Deep Neural Networks: A Concept Selection View

arXiv:2404.08964v110 citationsh-index: 8CogSci
Originality Incremental advance
AI Analysis

This addresses the need for transparency in complex AI models for researchers and practitioners, though it is incremental as it builds on existing concept-based methods.

The paper tackles the problem of interpreting multimodal deep neural networks like CLIP by proposing a two-stage Concept Selection Model (CSM) to automatically mine core concepts without human priors, achieving comparable performance to black-box models and producing interpretable concepts as validated by human evaluation.

The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is too difficult to understand and interpret. Concept-based models map the black-box visual representations extracted by deep neural networks onto a set of human-understandable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. However, these methods involve the datasets labeled with fine-grained attributes by expert knowledge, which incur high costs and introduce excessive human prior knowledge and bias. In this paper, we observe the long-tail distribution of concepts, based on which we propose a two-stage Concept Selection Model (CSM) to mine core concepts without introducing any human priors. The concept greedy rough selection algorithm is applied to extract head concepts, and then the concept mask fine selection method performs the extraction of core concepts. Experiments show that our approach achieves comparable performance to end-to-end black-box models, and human evaluation demonstrates that the concepts discovered by our method are interpretable and comprehensible for humans.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes