CVFeb 17, 2025

ILIAS: Instance-Level Image retrieval At Scale

Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas, Ondřej Chum, Giorgos Tolias

arXiv:2502.11748v324.321 citationsh-index: 28CVPR

Originality Synthesis-oriented

AI Analysis

This work provides a new benchmark dataset for the computer vision and retrieval communities, addressing the need for large-scale, diverse, and accurate evaluation of instance-level image retrieval, though it is incremental as it builds on existing dataset creation efforts.

The authors introduced ILIAS, a large-scale test dataset for instance-level image retrieval, designed to evaluate foundation models and retrieval techniques on recognizing specific objects, with benchmarking showing that domain-specific models excel in their domains but fail on ILIAS, and vision-language models benefit from linear adaptation layers.

This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. It is designed to evaluate the ability of current and future foundation models and retrieval techniques to recognize particular objects. The key benefits over existing datasets include large scale, domain diversity, accurate ground truth, and a performance that is far from saturated. ILIAS includes query and positive images for 1,000 object instances, manually collected to capture challenging conditions and diverse domains. Large-scale retrieval is conducted against 100 million distractor images from YFCC100M. To avoid false negatives without extra annotation effort, we include only query objects confirmed to have emerged after 2014, i.e. the compilation date of YFCC100M. An extensive benchmarking is performed with the following observations: i) models fine-tuned on specific domains, such as landmarks or products, excel in that domain but fail on ILIAS ii) learning a linear adaptation layer using multi-domain class supervision results in performance improvements, especially for vision-language models iii) local descriptors in retrieval re-ranking are still a key ingredient, especially in the presence of severe background clutter iv) the text-to-image performance of the vision-language foundation models is surprisingly close to the corresponding image-to-image case. website: https://vrg.fel.cvut.cz/ilias/

View on arXiv PDF

Similar