IRAIDLJun 23, 2023

Multimodal Search on Iconclass using Vision-Language Pre-Trained Models

arXiv:2306.16529v11 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better multimodal search capabilities in cultural heritage digitization, though it is incremental as it applies an existing model to a specific domain.

The paper tackled the problem of inadequate semantic representation in Information Retrieval systems for cultural heritage terminology by implementing a new search engine for the Iconclass classification system using the CLIP vision-language model, enabling retrieval of Iconclass concepts with visual or textual queries.

Terminology sources, such as controlled vocabularies, thesauri and classification systems, play a key role in digitizing cultural heritage. However, Information Retrieval (IR) systems that allow to query and explore these lexical resources often lack an adequate representation of the semantics behind the user's search, which can be conveyed through multiple expression modalities (e.g., images, keywords or textual descriptions). This paper presents the implementation of a new search engine for one of the most widely used iconography classification system, Iconclass. The novelty of this system is the use of a pre-trained vision-language model, namely CLIP, to retrieve and explore Iconclass concepts using visual or textual queries.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes