Enhancing Multilingual Language Models for Code-Switched Input Data
This work addresses the problem of handling code-switched data for NLP applications in globalized contexts, representing an incremental improvement by adapting an existing model to a specific dataset.
The research tackled the challenge of code-switching in multilingual language models by pre-training Multilingual BERT on a Spanglish tweet dataset, resulting in improved performance on NLP tasks like part-of-speech tagging, sentiment analysis, named entity recognition, and language identification, with the most significant gains in part-of-speech tagging.
Code-switching, or alternating between languages within a single conversation, presents challenges for multilingual language models on NLP tasks. This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks such as part of speech tagging, sentiment analysis, named entity recognition, and language identification. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging. Additionally, our latent analysis uncovers more homogenous English and Spanish embeddings for language identification tasks, providing insights for future modeling work. This research highlights potential for adapting multilingual LMs for code-switched input data in order for advanced utility in globalized and multilingual contexts. Future work includes extending experiments to other language pairs, incorporating multiform data, and exploring methods for better understanding context-dependent code-switches.