LGCLIRFeb 3, 2024

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

arXiv:2402.01963v12 citationsh-index: 8Mathematics
Originality Incremental advance
AI Analysis

This addresses multi-label text categorization for biomedical document indexing, but it appears incremental as it builds on traditional k-NN with autoencoders.

The paper tackles the problem of automatic semantic indexing in large document collections with complex label vocabularies by evolving the k-Nearest Neighbors algorithm using a label autoencoder to map and regenerate labels in a reduced latent space, evaluated on a large portion of the MEDLINE biomedical collection with MeSH thesaurus.

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes