LGMLJan 24, 2024

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

arXiv:2401.13544v334 citationsNIPS
Originality Incremental advance
AI Analysis

This work addresses the need for interpretability and user control in machine learning models, particularly in high-stakes domains like medical imaging, by providing a method to intervene on black-box models without requiring interpretable design.

The paper tackles the problem of making pretrained black-box neural networks intervenable by enabling concept-based interventions, and demonstrates that fine-tuning improves intervention effectiveness and prediction calibration on synthetic and real-world benchmarks, including chest X-ray classifiers where it outperforms concept bottleneck models.

Recently, interpretable machine learning has re-explored concept bottleneck models (CBM). An advantage of this model class is the user's ability to intervene on predicted concept values, affecting the downstream output. In this work, we introduce a method to perform such concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. Furthermore, we formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes