An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?
This addresses the challenge of sustainable subject indexing for digital libraries, though it is incremental as it builds on existing authority files and classification methods.
The authors tackled the problem of subject indexing at scale across languages by releasing a large bilingual English/German corpus of catalog records annotated with the Integrated Authority File (GND) and a machine-actionable GND taxonomy. This resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible evaluation.
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.