Image Similarity using An Ensemble of Context-Sensitive Models
This work addresses the challenge of improving and comparing image similarity models for computer vision researchers, though it is incremental as it builds on existing context-sensitive methods with an ensemble approach.
The paper tackled the problem of evaluating image similarity models by introducing a more intuitive labeling approach based on relative comparisons (A:R vs B:R) and addressing sparse sampling and model biases with an ensemble method, resulting in a ~5% performance improvement over individual models and outperforming existing deep embeddings like CLIP and DINO.
Image similarity has been extensively studied in computer vision. In recent years, machine-learned models have shown their ability to encode more semantics than traditional multivariate metrics. However, in labelling semantic similarity, assigning a numerical score to a pair of images is impractical, making the improvement and comparisons on the task difficult. In this work, we present a more intuitive approach to build and compare image similarity models based on labelled data in the form of A:R vs B:R, i.e., determining if an image A is closer to a reference image R than another image B. We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model. Our testing results show that the ensemble model constructed performs ~5% better than the best individual context-sensitive models. They also performed better than the models that were directly fine-tuned using mixed imagery data as well as existing deep embeddings, e.g., CLIP and DINO. This work demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.