IR DL LG MLOct 16, 2019

Using Supervised Learning to Classify Metadata of Research Data by Discipline of Research

Tobias Weber, Dieter Kranzlmüller, Michael Fromm, Nelson Tavares de Sousa

arXiv:1910.09313v11.7

Originality Synthesis-oriented

AI Analysis

This enables automated discipline classification for scientometrics, repository services, and data aggregation, though it is incremental as it applies existing methods to a new dataset.

The paper tackled the problem of automatically classifying research data metadata by discipline using a large dataset of 609,524 records from DataCite, achieving a best f1-macro score of 0.760 with multi-layer perceptron models.

Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records, which is published alongside this paper. These data allow to reproducibly assess classification approaches, such as tree-based models and neural networks. According to our experiments with 20 base classes (multi-label classification), multi-layer perceptron models perform best with a f1-macro score of 0.760 closely followed by Long Short-Term Memory models (f1-macro score of 0.755). A possible application of the trained classification models is the quantitative analysis of trends towards interdisciplinarity of digital scholarly output or the characterization of growth patterns of research data, stratified by discipline of research. Both applications perform at scale with the proposed models which are available for re-use.

View on arXiv PDF

Similar