Code-Mix Sentiment Analysis on Hinglish Tweets
This addresses the challenge of brand sentiment tracking in India for marketers and NLP researchers, but it is incremental as it adapts an existing method to a specific domain.
The paper tackled the problem of inaccurate sentiment analysis for Hinglish (Hindi-English code-mixed) tweets, which hinders brand monitoring in India, by fine-tuning mBERT with subword tokenization to achieve a production-ready solution and set a benchmark for multilingual NLP in low-resource settings.
The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.