Similarities between Arabic Dialects: Investigating Geographical Proximity
This work addresses the problem of Arabic dialect classification for linguists and NLP researchers by highlighting the importance of geographical proximity over country borders, though it is incremental as it builds on existing datasets and methods.
This paper investigates how geographical proximity affects dialectical similarity between Arabic cities, using cosine similarity and distance measurements on MADAR and NADI datasets. The results show that cities in different countries can be more similar than those within the same country if they are closer geographically, suggesting a need for granular dialect classification.
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectical similarity. Our work is twofold, reliant on: 1) comparing the textual similarities between dialects using cosine similarity and 2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectical similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectical similarity and city proximity suggests that cities that are closer together are more likely to share dialectical attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialects identification.