Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor
This work addresses the need for better evaluation metrics in visual concept discovery and image classification, offering incremental improvements for researchers using vision-language models.
The paper tackled the problem of evaluating text-based visual descriptors for vision-language models by analyzing descriptor quality along representational capacity and relationship with pre-training data, introducing alignment-based metrics that provide insights beyond accuracy.
Text-based visual descriptors--ranging from simple class names to more descriptive phrases--are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy. These metrics shed light on how different descriptor generation strategies interact with foundation model properties, offering new ways to study descriptor effectiveness beyond accuracy evaluations.