CLJun 1, 2020

Lexical Normalization for Code-switched Data and its Effect on POS-tagging

arXiv:2006.01175v20.816 citations

Originality Incremental advance

AI Analysis

This addresses the problem of handling code-switching in social media NLP for language pairs like Indonesian-English and Turkish-German, though it is incremental as it builds on existing normalization methods.

The paper tackles lexical normalization for code-switched social media data, proposing three models for Indonesian-English and Turkish-German language pairs. Results show these models outperform existing approaches and yield a 5.4% relative performance increase for POS tagging compared to unnormalized input.

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.

View on arXiv PDF

Similar