CLSep 16, 2019

Short-Text Classification Using Unsupervised Keyword Expansion

arXiv:1909.07512v10.2

Originality Incremental advance

AI Analysis

This addresses the problem of low performance in short-text classification for data scientists, though it is incremental as it builds on existing language models.

The paper tackled short-text classification with limited data by generating relevant keywords using a pre-trained language model to create extended pseudo documents, improving classification accuracy when training examples were scarce.

Short-text classification, like all data science, struggles to achieve high performance using limited data. As a solution, a short sentence may be expanded with new and relevant feature words to form an artificially enlarged dataset, and add new features to testing data. This paper applies a novel approach to text expansion by generating new words directly for each input sentence, thus requiring no additional datasets or previous training. In this unsupervised approach, new keywords are formed within the hidden states of a pre-trained language model and then used to create extended pseudo documents. The word generation process was assessed by examining how well the predicted words matched to topics of the input sentence. It was found that this method could produce 3-10 relevant new words for each target topic, while generating just 1 word related to each non-target topic. Generated words were then added to short news headlines to create extended pseudo headlines. Experimental results have shown that models trained using the pseudo headlines can improve classification accuracy when limiting the number of training examples.

View on arXiv PDF

Similar