CLJan 13, 2016

Predicting the Effectiveness of Self-Training: Application to Sentiment Classification

arXiv:1601.03288v119 citations

Originality Incremental advance

AI Analysis

This addresses the unpredictability of self-training for NLP practitioners, offering a practical tool to enhance semi-supervised learning in sentiment analysis, though it is incremental as it builds on existing self-training methods.

The paper tackled the problem of predicting when self-training improves performance in sentiment classification by showing that corpus similarity can identify beneficial setups, achieving a performance gain of up to 15% in accuracy for high-similarity cases.

The goal of this paper is to investigate the connection between the performance gain that can be obtained by selftraining and the similarity between the corpora used in this approach. Self-training is a semi-supervised technique designed to increase the performance of machine learning algorithms by automatically classifying instances of a task and adding these as additional training material to the same classifier. In the context of language processing tasks, this training material is mostly an (annotated) corpus. Unfortunately self-training does not always lead to a performance increase and whether it will is largely unpredictable. We show that the similarity between corpora can be used to identify those setups for which self-training can be beneficial. We consider this research as a step in the process of developing a classifier that is able to adapt itself to each new test corpus that it is presented with.

View on arXiv PDF

Similar