CLDec 24, 2017

Building a Sentiment Corpus of Tweets in Brazilian Portuguese

Henrico Bertini Brum, Maria das Graças Volpe Nunes

arXiv:1712.08917v139.31094 citations

Originality Synthesis-oriented

AI Analysis

This provides a new dataset for researchers working on sentiment analysis in Brazilian Portuguese, but it is incremental as it applies existing methods to a new language and domain.

The paper tackled the lack of sentiment analysis datasets for Brazilian Portuguese by introducing TweetSentBR, a manually annotated corpus of 15,000 tweets in the TV show domain, achieving baseline classification accuracies of up to 82.06% for binary and 64.62% for three-class tasks.

The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different languages. This paper introduces TweetSentBR, a sentiment corpora for Brazilian Portuguese manually annotated with 15.000 sentences on TV show domain. The sentences were labeled in three classes (positive, neutral and negative) by seven annotators, following literature guidelines for ensuring reliability on the annotation. We also ran baseline experiments on polarity classification using three machine learning methods, reaching 80.99% on F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure and 64.62% on accuracy on three point classification.

View on arXiv PDF

Similar