CLAug 30, 2019

CodeSwitch-Reddit: Exploration of Written Multilingual Discourse in Online Discussion Forums

arXiv:1908.11841v1996 citations
AI Analysis

This addresses a gap in sociolinguistic research for linguists and computational researchers by providing a new dataset, though it is incremental in applying existing oral code-switching questions to written data.

The researchers tackled the problem of limited resources for studying written code-switching by introducing a novel, large, and diverse dataset curated from bilingual Reddit communities, and they explored whether findings from oral code-switching apply to written forms.

In contrast to many decades of research on oral code-switching, the study of written multilingual productions has only recently enjoyed a surge of interest. Many open questions remain regarding the sociolinguistic underpinnings of written code-switching, and progress has been limited by a lack of suitable resources. We introduce a novel, large, and diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far. We investigate whether findings in oral code-switching concerning content and style, as well as speaker proficiency, are carried over into written code-switching in discussion forums. The released dataset can further facilitate a range of research and practical activities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes