Unlocking Text Capabilities in Vision Models
This work addresses the interpretability challenge in vision models for researchers and practitioners, offering a data-efficient method to enhance model transparency without compromising performance.
The authors tackled the problem of making visual classifiers interpretable by enabling them to be queried with free-form text, achieving new state-of-the-art results in zero-shot concept bottleneck models and feature decoding while using up to 400x fewer images and 400,000x less text during training.
Visual classifiers provide high-dimensional feature representations that are challenging to interpret and analyze. Text, in contrast, provides a more expressive and human-friendly interpretable medium for understanding and analyzing model behavior. We propose a simple, yet powerful method for reformulating any pretrained visual classifier so that it can be queried with free-form text without compromising its original performance. Our approach is label-free, data and compute-efficient, and is trained to preserve the underlying classifiers distribution and decision-making processes. Our method unlocks several zero-shot text interpretability applications for any visual classifier. We apply our method on 40 visual classifiers and demonstrate two primary applications: 1) building both label-free and zero-shot concept bottleneck models and therefore converting any visual classifier to be inherently-interpretable and 2) zero-shot decoding of visual features into natural language sentences. In both tasks we establish new state-of-the-art results, outperforming existing works and surpassing CLIP-based baselines with ImageNet-only trained classifiers, while using up to 400x fewer images and 400,000x less text during training.