Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
This work addresses the need for better multimodal search capabilities in cultural heritage digitization, though it is incremental as it applies an existing model to a specific domain.
The paper tackled the problem of inadequate semantic representation in Information Retrieval systems for cultural heritage terminology by implementing a new search engine for the Iconclass classification system using the CLIP vision-language model, enabling retrieval of Iconclass concepts with visual or textual queries.
Terminology sources, such as controlled vocabularies, thesauri and classification systems, play a key role in digitizing cultural heritage. However, Information Retrieval (IR) systems that allow to query and explore these lexical resources often lack an adequate representation of the semantics behind the user's search, which can be conveyed through multiple expression modalities (e.g., images, keywords or textual descriptions). This paper presents the implementation of a new search engine for one of the most widely used iconography classification system, Iconclass. The novelty of this system is the use of a pre-trained vision-language model, namely CLIP, to retrieve and explore Iconclass concepts using visual or textual queries.