CLMay 16, 2021

The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

arXiv:2105.07400v3712 citations
Originality Incremental advance
AI Analysis

This addresses a less studied problem in NLP for researchers working with low-resource languages like Algerian dialect, though it is incremental as it builds on existing multilingual models.

The study investigated how language similarity and script differences affect cross-lingual transfer for part-of-speech tagging and sentiment analysis, finding a delicate relationship for part-of-speech tagging while sentiment analysis was less sensitive.

Recent years have seen a rise in interest for cross-lingual transfer between languages with similar typology, and between languages of various scripts. However, the interplay between language similarity and difference in script on cross-lingual transfer is a less studied problem. We explore this interplay on cross-lingual transfer for two supervised tasks, namely part-of-speech tagging and sentiment analysis. We introduce a newly annotated corpus of Algerian user-generated comments comprising parallel annotations of Algerian written in Latin, Arabic, and code-switched scripts, as well as annotations for sentiment and topic categories. We perform baseline experiments by fine-tuning multi-lingual language models. We further explore the effect of script vs. language similarity in cross-lingual transfer by fine-tuning multi-lingual models on languages which are a) typologically distinct, but use the same script, b) typologically similar, but use a distinct script, or c) are typologically similar and use the same script. We find there is a delicate relationship between script and typology for part-of-speech, while sentiment analysis is less sensitive.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes