LG CL IRFeb 3, 2024

Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Francisco J. Ribadas-Pena, Shuyuan Cao, Víctor M. Darriba Bilbao

arXiv:2402.01963v12.62 citationsh-index: 8Mathematics

Originality Incremental advance

AI Analysis

This addresses multi-label text categorization for biomedical document indexing, but it appears incremental as it builds on traditional k-NN with autoencoders.

The paper tackles the problem of automatic semantic indexing in large document collections with complex label vocabularies by evolving the k-Nearest Neighbors algorithm using a label autoencoder to map and regenerate labels in a reduced latent space, evaluated on a large portion of the MEDLINE biomedical collection with MeSH thesaurus.

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

View on arXiv PDF

Similar