DBAIJun 8

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

Luciano Duarte, Olga Ovcharenko, Sebastian Schelter
arXiv:2606.09648v18.2
Predicted impact top 59% in DB · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers in multi-modal data management, this dataset provides a real-world benchmark to advance error detection and semantic query processing.

The paper introduces ArtiFact, a large-scale multi-modal cultural heritage dataset of 651,045 museum records, and demonstrates its utility through two downstream tasks: cross-modal error detection and semantic query processing. Results show that current systems struggle with domain-specific errors and complex queries, positioning ArtiFact as a challenging benchmark.

Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes