Sentiment Analysis on Brazilian Portuguese User Reviews
This is an incremental contribution that addresses a domain-specific problem for researchers and practitioners working with Brazilian Portuguese sentiment analysis by providing standardized datasets and benchmarks.
This work tackled the lack of linguistic resources for sentiment analysis in Brazilian Portuguese by creating a unified dataset with predefined partitions and evaluating document embedding strategies, finding that dataset-specific models generally outperformed a single model across different contexts.
Sentiment Analysis is one of the most classical and primarily studied natural language processing tasks. This problem had a notable advance with the proposition of more complex and scalable machine learning models. Despite this progress, the Brazilian Portuguese language still disposes only of limited linguistic resources, such as datasets dedicated to sentiment classification, especially when considering the existence of predefined partitions in training, testing, and validation sets that would allow a more fair comparison of different algorithm alternatives. Motivated by these issues, this work analyzes the predictive performance of a range of document embedding strategies, assuming the polarity as the system outcome. This analysis includes five sentiment analysis datasets in Brazilian Portuguese, unified in a single dataset, and a reference partitioning in training, testing, and validation sets, both made publicly available through a digital repository. A cross-evaluation of dataset-specific models over different contexts is conducted to evaluate their generalization capabilities and the feasibility of adopting a unique model for addressing all scenarios.