CLSep 28, 2019

Part of speech tagging for code switched data

arXiv:1909.13006v21101 citations
AI Analysis

This addresses the challenge of processing intra-sentential code-switching in NLP, which is incremental as it applies existing methods to a specific linguistic problem.

The paper tackles part-of-speech tagging for code-switched data, comparing strategies like using two monolingual taggers versus a unified tagger, and finds that applying two state-of-the-art POS taggers yields the best performance.

We address the problem of Part of Speech tagging (POS) in the context of linguistic code switching (CS). CS is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential CS, respectively. Processing CS data is especially challenging in intra-sentential data given state of the art monolingual NLP technology since such technology is geared toward the processing of one language at a time. In this paper we explore multiple strategies of applying state of the art POS taggers to CS data. We investigate the landscape in two CS language pairs, Spanish-English and Modern Standard Arabic-Arabic dialects. We compare the use of two POS taggers vs. a unified tagger trained on CS data. Our results show that applying a machine learning framework using two state of the art POS taggers achieves better performance compared to all other approaches that we investigate.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes