From Standard English to Singlish: A Retrieval-Augmented Approach for Code-Switched Creole Generation in Large Language Models
For NLP practitioners working on low-resource code-switched varieties, this work offers a practical, auditable alternative to fine-tuning, though it is incremental as it applies existing RAG techniques to a new domain.
The paper tackles code-switched generation for Singaporean English (Singlish) using a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon. Human evaluation with 164 participants found RAG and zero-shot prompting equally natural and appropriate, with RAG performing minimal substitutions (median 1 edit) and higher semantic preservation (cosine similarity 0.978 vs. 0.926).
Code-switching in contact varieties like Singaporean English (Singlish) challenges natural language generation due to limited parallel data and rapid lexical evolution. We propose a retrieval-augmented generation (RAG) framework that externalizes dialectal knowledge into a curated lexicon, enabling controlled lexical code-switching without fine-tuning. Our approach retrieves candidate Singlish expressions and guides generation through sparse lexical substitution. Human evaluation with 164 Singaporean participants found RAG and zero-shot prompting equally natural and appropriate. Automatic analyses reveal different transformation regimes: zero-shot prompting induces extensive paraphrasing (median 23 token edits), whereas RAG performs minimal substitutions (median 1 edit) with higher semantic preservation (mean cosine similarity 0.978 vs. 0.926). Our results demonstrate that externalizing code-switching into lexical resources enables control and auditability without sacrificing perceived quality, offering practical advantages for rapidly evolving contact varieties.