Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric
This addresses the need for more interpretable similarity metrics in computer vision, though it appears incremental as it builds on existing Vision Transformer and k-NN methods.
The paper tackles the problem of improving explainability in image similarity assessments by proposing the Case-Enhanced Vision Transformer (CEViT), which integrates into k-NN classification to achieve accuracy comparable to state-of-the-art models while enabling illustration of differences between classes.
This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.