CLMay 30, 2018

Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition

Genta Indra Winata, Chien-Sheng Wu, Andrea Madotto, Pascale Fung

arXiv:1805.12061v232.01099 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of processing noisy, multilingual social media text for NLP applications, but it is incremental as it builds on existing methods for a specific domain.

The authors tackled the problem of named entity recognition in code-switching Twitter data by proposing an LSTM-based model with bilingual character representation and transfer learning to handle out-of-vocabulary words, achieving a 62.76% harmonic mean F1-score in a shared task.

We propose an LSTM-based model with hierarchical architecture on named entity recognition from code-switching Twitter data. Our model uses bilingual character representation and transfer learning to address out-of-vocabulary words. In order to mitigate data noise, we propose to use token replacement and normalization. In the 3rd Workshop on Computational Approaches to Linguistic Code-Switching Shared Task, we achieved second place with 62.76% harmonic mean F1-score for English-Spanish language pair without using any gazetteer and knowledge-based information.

View on arXiv PDF

Similar