CL AI DL IRMar 11

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

arXiv:2603.10876v15.51 citationsh-index: 22

Predicted impact top 79% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses the challenge of sustainable subject indexing for digital libraries, though it is incremental as it builds on existing authority files and classification methods.

The authors tackled the problem of subject indexing at scale across languages by releasing a large bilingual English/German corpus of catalog records annotated with the Integrated Authority File (GND) and a machine-actionable GND taxonomy. This resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible evaluation.

Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.

View on arXiv PDF

Similar