CLMar 30, 2022

Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

Elena Álvarez-Mellado, Constantine Lignos

arXiv:2203.16169v131.9638 citationsh-index: 13Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of identifying borrowings in Spanish for computational linguistics, but it is incremental as it builds on existing methods with a new corpus.

The paper tackles the problem of detecting unassimilated lexical borrowings in Spanish by introducing a new annotated corpus of 370,000 tokens and evaluating sequence labeling models, finding that a BiLSTM-CRF model with specific embeddings outperforms a multilingual BERT-based model.

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings -- words from one language that are introduced into another without orthographic adaptation -- and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.

View on arXiv PDF Code

Similar