Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English
This work addresses sentiment analysis for multilingual social media users, but it is incremental as it applies existing methods to a specific dataset with minor adaptations.
The paper tackled sentiment polarity detection in YouTube comments with code-switching between Tamil, Malayalam, and English, achieving a weighted average F1 score of 0.77 for Tamil-English after the task deadline, which betters the top-ranked classifier by a wide margin.
Theedhum Nandrum is a sentiment polarity detection system using two approaches--a Stochastic Gradient Descent (SGD) based classifier and a Long Short-term Memory (LSTM) based Classifier. Our approach utilises language features like use of emoji, choice of scripts and code mixing which appeared quite marked in the datasets specified for the Dravidian Codemix - FIRE 2020 task. The hyperparameters for the SGD were tuned using GridSearchCV. Our system was ranked 4th in Tamil-English with a weighted average F1 score of 0.62 and 9th in Malayalam-English with a score of 0.65. We achieved a weighted average F1 score of 0.77 for Tamil-English using a Logistic Regression based model after the task deadline. This performance betters the top ranked classifier on this dataset by a wide margin. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex. Our complete code is published in github at https://github.com/oligoglot/theedhum-nandrum.