Taxonomy-Aware Evaluation of Vision-Language Models
This work addresses evaluation challenges for VLMs in fine-grained classification, but it is incremental as it builds on existing evaluation methods by incorporating taxonomic hierarchies.
The paper tackles the problem of evaluating vision-language models (VLMs) when they generate unconstrained text that may not match specific ground truth labels, by proposing a framework that uses hierarchical precision and recall measures based on a taxonomy to assess correctness and specificity. The result includes experimental analysis showing that existing text similarity measures fail to capture taxonomic similarity, and the framework is applied to evaluate modern VLMs on fine-grained visual classification tasks.
When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.