CLJun 10, 2022

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

Elena Alvarez Mellado, Constantine Lignos

arXiv:2206.04973v131.1589 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This provides a resource for linguists and NLP researchers studying language mixing, but it is incremental as it builds on prior corpora with finer-grained distinctions.

The authors tackled the problem of distinguishing between codeswitching and borrowing in language mixing by creating a new corpus of 9,500 Spanish-English tweets annotated at the token level, enabling study and modeling of these phenomena on Twitter.

We present a new corpus of Twitter data annotated for codeswitching and borrowing between Spanish and English. The corpus contains 9,500 tweets annotated at the token level with codeswitches, borrowings, and named entities. This corpus differs from prior corpora of codeswitching in that we attempt to clearly define and annotate the boundary between codeswitching and borrowing and do not treat common "internet-speak" ('lol', etc.) as codeswitching when used in an otherwise monolingual context. The result is a corpus that enables the study and modeling of Spanish-English borrowing and codeswitching on Twitter in one dataset. We present baseline scores for modeling the labels of this corpus using Transformer-based language models. The annotation itself is released with a CC BY 4.0 license, while the text it applies to is distributed in compliance with the Twitter terms of service.

View on arXiv PDF Code

Similar