CVJan 15, 2024

Image Similarity using An Ensemble of Context-Sensitive Models

Oxford

arXiv:2401.07951v23.73 citationsh-index: 3Has CodeKDD

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving and comparing image similarity models for computer vision researchers, though it is incremental as it builds on existing context-sensitive methods with an ensemble approach.

The paper tackled the problem of evaluating image similarity models by introducing a more intuitive labeling approach based on relative comparisons (A:R vs B:R) and addressing sparse sampling and model biases with an ensemble method, resulting in a ~5% performance improvement over individual models and outperforming existing deep embeddings like CLIP and DINO.

Image similarity has been extensively studied in computer vision. In recent years, machine-learned models have shown their ability to encode more semantics than traditional multivariate metrics. However, in labelling semantic similarity, assigning a numerical score to a pair of images is impractical, making the improvement and comparisons on the task difficult. In this work, we present a more intuitive approach to build and compare image similarity models based on labelled data in the form of A:R vs B:R, i.e., determining if an image A is closer to a reference image R than another image B. We address the challenges of sparse sampling in the image space (R, A, B) and biases in the models trained with context-based data by using an ensemble model. Our testing results show that the ensemble model constructed performs ~5% better than the best individual context-sensitive models. They also performed better than the models that were directly fine-tuned using mixed imagery data as well as existing deep embeddings, e.g., CLIP and DINO. This work demonstrates that context-based labelling and model training can be effective when an appropriate ensemble approach is used to alleviate the limitation due to sparse sampling.

View on arXiv PDF Code

Similar