IR LGNov 28, 2024

Introducing Three New Benchmark Datasets for Hierarchical Text Classification

Jaco du Toit, Herman Redelinghuys, Marcel Dunaiski

arXiv:2411.19119v11 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This provides new benchmark datasets for hierarchical text classification of scientific publications, which is an incremental contribution to the field.

The authors introduced three new benchmark datasets for hierarchical text classification in the research publications domain, created by combining existing classification schemas to improve reliability, and demonstrated that their proposed approach yields higher quality datasets with semantically more similar documents within classes.

Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts of papers from the Web of Science publication database. We first create two baseline datasets which use existing journal-and citation-based classification schemas. Due to the respective shortcomings of these two existing schemas, we propose an approach which combines their classifications to improve the reliability and robustness of the dataset. We evaluate the three created datasets with a clustering-based analysis and show that our proposed approach results in a higher quality dataset where documents that belong to the same class are semantically more similar compared to the other datasets. Finally, we provide the classification performance of four state-of-the-art HTC approaches on these three new datasets to provide baselines for future studies on machine learning-based techniques for scientific publication classification.

View on arXiv PDF

Similar