CLDec 24, 2017

Building a Sentiment Corpus of Tweets in Brazilian Portuguese

arXiv:1712.08917v11094 citations
Originality Synthesis-oriented
AI Analysis

This provides a new dataset for researchers working on sentiment analysis in Brazilian Portuguese, but it is incremental as it applies existing methods to a new language and domain.

The paper tackled the lack of sentiment analysis datasets for Brazilian Portuguese by introducing TweetSentBR, a manually annotated corpus of 15,000 tweets in the TV show domain, achieving baseline classification accuracies of up to 82.06% for binary and 64.62% for three-class tasks.

The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different languages. This paper introduces TweetSentBR, a sentiment corpora for Brazilian Portuguese manually annotated with 15.000 sentences on TV show domain. The sentences were labeled in three classes (positive, neutral and negative) by seven annotators, following literature guidelines for ensuring reliability on the annotation. We also ran baseline experiments on polarity classification using three machine learning methods, reaching 80.99% on F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure and 64.62% on accuracy on three point classification.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes