Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data
This addresses the problem of scarce annotated data for code-switched languages in multilingual communities, though it is incremental as it builds on existing pre-trained models and unsupervised techniques.
The paper tackles sentiment analysis for code-switched social media data by proposing an unsupervised self-training framework that uses pre-trained BERT models and pseudo labels from zero-shot transfer, achieving performance within 1-7% of supervised models in weighted F1 scores.
Sentiment analysis is an important task in understanding social media content like customer reviews, Twitter and Facebook feeds etc. In multilingual communities around the world, a large amount of social media text is characterized by the presence of Code-Switching. Thus, it has become important to build models that can handle code-switched data. However, annotated code-switched data is scarce and there is a need for unsupervised models and algorithms. We propose a general framework called Unsupervised Self-Training and show its applications for the specific use case of sentiment analysis of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot transfer. We test our algorithm on multiple code-switched languages and provide a detailed analysis of the learning dynamics of the algorithm with the aim of answering the question - `Does our unsupervised model understand the Code-Switched languages or does it just learn its representations?'. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7\% (weighted F1 scores) when compared to supervised models trained for a two class problem.