CVMar 14, 2025

Unlocking Text Capabilities in Vision Models

arXiv:2503.10981v2h-index: 13
Originality Highly original
AI Analysis

This work addresses the interpretability challenge in vision models for researchers and practitioners, offering a data-efficient method to enhance model transparency without compromising performance.

The authors tackled the problem of making visual classifiers interpretable by enabling them to be queried with free-form text, achieving new state-of-the-art results in zero-shot concept bottleneck models and feature decoding while using up to 400x fewer images and 400,000x less text during training.

Visual classifiers provide high-dimensional feature representations that are challenging to interpret and analyze. Text, in contrast, provides a more expressive and human-friendly interpretable medium for understanding and analyzing model behavior. We propose a simple, yet powerful method for reformulating any pretrained visual classifier so that it can be queried with free-form text without compromising its original performance. Our approach is label-free, data and compute-efficient, and is trained to preserve the underlying classifiers distribution and decision-making processes. Our method unlocks several zero-shot text interpretability applications for any visual classifier. We apply our method on 40 visual classifiers and demonstrate two primary applications: 1) building both label-free and zero-shot concept bottleneck models and therefore converting any visual classifier to be inherently-interpretable and 2) zero-shot decoding of visual features into natural language sentences. In both tasks we establish new state-of-the-art results, outperforming existing works and surpassing CLIP-based baselines with ImageNet-only trained classifiers, while using up to 400x fewer images and 400,000x less text during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes