Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models
It addresses the problem of enabling effective NLP for multilingual societies, but it is incremental as it synthesizes existing research rather than introducing new methods.
This survey tackles the challenge of code-switching in multilingual NLP, analyzing 308 studies to show that large language models still struggle with mixed-language inputs, limited datasets, and evaluation biases, despite advances.
Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 308 studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.