Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints
This addresses the challenge of maintaining high data quality in digital libraries for short texts, which is an incremental improvement over existing multi-label classification methods.
The paper tackled the problem of ensuring quality constraints in automatic subject indexing of short texts by proposing a novel approach that detects documents meeting quality criteria, rather than just concept relevance. The result showed that the technique can achieve considerable gains in document-level recall while maintaining precision on law and economics text collections.
Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.