EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in GLAM Collections
This addresses metadata annotation challenges for cultural heritage institutions, but it is incremental as it focuses on a new dataset with baseline models.
The paper tackles the problem of automatic metadata annotation for GLAM collections by introducing the EUFCC-340K dataset with over 340,000 images organized hierarchically across facets like Materials and Object Types, and experiments show it improves multi-label classification tools for cataloging tasks.
In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC340K dataset is organized across multiple facets: Materials, Object Types, Disciplines, and Subjects, following a hierarchical structure based on the Art & Architecture Thesaurus (AAT). We developed several baseline models, incorporating multiple heads on a ConvNeXT backbone for multi-label image tagging on these facets, and fine-tuning a CLIP model with our image text pairs. Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the utility of the dataset in improving multi-label classification tools that have the potential to alleviate cataloging tasks in the cultural heritage sector.