Linear Semantic Segmentation for Low-Resource Spoken Dialects
For researchers working on discourse analysis of low-resource spoken dialects, this work provides a benchmark and a method that addresses the gap in semantic segmentation for informal, code-switched speech.
The paper introduces a new multi-genre benchmark for semantic segmentation in dialectal Arabic and shows that existing models degrade on dialectal speech. Their proposed model outperforms baselines on dialectal non-news genres, achieving consistent improvements.
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.