PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck
This addresses the challenge of fine-grained image classification for domains with rare or new class names, such as scientific bird names, by enabling explainable and editable classifiers without retraining.
The paper tackles the problem of CLIP-based classifiers performing poorly on new or rare classes by proposing PEEB, a part-based classifier that uses editable text descriptors for fine-grained classification, achieving up to ~10x higher top-1 accuracy in zero-shot settings and state-of-the-art supervised accuracy on datasets like CUB-200 and Dogs-120.
CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder. Therefore, they perform poorly on new classes or the classes whose names rarely appear on the Internet (e.g., scientific names of birds). For fine-grained classification, we propose PEEB - an explainable and editable classifier to (1) express the class name into a set of text descriptors that describe the visual parts of that class; and (2) match the embeddings of the detected parts to their textual descriptors in each class to compute a logit score for classification. In a zero-shot setting where the class names are unknown, PEEB outperforms CLIP by a huge margin (~10x in top-1 accuracy). Compared to part-based classifiers, PEEB is not only the state-of-the-art (SOTA) on the supervised-learning setting (88.80% and 92.20% accuracy on CUB-200 and Dogs-120, respectively) but also the first to enable users to edit the text descriptors to form a new classifier without any re-training. Compared to concept bottleneck models, PEEB is also the SOTA in both zero-shot and supervised-learning settings.