MLAICVLGApr 6, 2021

Robust Semantic Interpretability: Revisiting Concept Activation Vectors

arXiv:2104.02768v116 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better debugging tools and identification of inductive biases in machine learning models, though it is incremental by generalizing and improving upon existing concept activation vectors.

The authors tackled the problem of semantic interpretability in image classification by proposing Robust Concept Activation Vectors (RCAV), which quantifies the effects of semantic concepts on model predictions and behavior, showing it yields more accurate and robust interpretations compared to previous methods.

Interpretability methods for image classification assess model trustworthiness by attempting to expose whether the model is systematically biased or attending to the same cues as a human would. Saliency methods for feature attribution dominate the interpretability literature, but these methods do not address semantic concepts such as the textures, colors, or genders of objects within an image. Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole. RCAV calculates a concept gradient and takes a gradient ascent step to assess model sensitivity to the given concept. By generalizing previous work on concept activation vectors to account for model non-linearity, and by introducing stricter hypothesis testing, we show that RCAV yields interpretations which are both more accurate at the image level and robust at the dataset level. RCAV, like saliency methods, supports the interpretation of individual predictions. To evaluate the practical use of interpretability methods as debugging tools, and the scientific use of interpretability methods for identifying inductive biases (e.g. texture over shape), we construct two datasets and accompanying metrics for realistic benchmarking of semantic interpretability methods. Our benchmarks expose the importance of counterfactual augmentation and negative controls for quantifying the practical usability of interpretability methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes