CL LGFeb 8, 2017

Data Selection Strategies for Multi-Domain Sentiment Analysis

Sebastian Ruder, Parsa Ghaffari, John G. Breslin

arXiv:1702.02426v14.831 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of data selection in multi-domain sentiment analysis, which is important for improving model performance across varied domains like tweets and reviews, though it appears incremental as it builds on existing domain adaptation approaches.

The authors tackled the problem of selecting appropriate training data for multi-domain sentiment analysis by studying domain similarity metrics and proposing novel representations and metrics, demonstrating that their selection strategy consistently outperforms random and balanced baselines and yields the best score on a large reviews corpus.

Domain adaptation is important in sentiment analysis as sentiment-indicating words vary between domains. Recently, multi-domain adaptation has become more pervasive, but existing approaches train on all available source domains including dissimilar ones. However, the selection of appropriate training data is as important as the choice of algorithm. We undertake -- to our knowledge for the first time -- an extensive study of domain similarity metrics in the context of sentiment analysis and propose novel representations, metrics, and a new scope for data selection. We evaluate the proposed methods on two large-scale multi-domain adaptation settings on tweets and reviews and demonstrate that they consistently outperform strong random and balanced baselines, while our proposed selection strategy outperforms instance-level selection and yields the best score on a large reviews corpus.

View on arXiv PDF

Similar