CLJun 26, 2024

Implicit Discourse Relation Classification For Nigerian Pidgin

Muhammed Saeed, Peter Bourgonje, Vera Demberg

arXiv:2406.18776v213.522 citations

Originality Incremental advance

AI Analysis

This work addresses the performance gap in NLP for under-resourced languages like Nigerian Pidgin, spoken by nearly 100 million people, by developing a more effective classification method, though it is incremental as it builds on existing translation and corpus techniques.

The paper tackled the problem of implicit discourse relation classification for under-resourced Nigerian Pidgin by comparing translation-based and synthetic corpus approaches, resulting in a native classifier that outperformed the baseline by 13.27% and 33.98% in F1 scores for 4-way and 11-way classification.

Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a "native" NP classifier outperforms our baseline by 13.27\% and 33.98\% in f$_{1}$ score for 4-way and 11-way classification, respectively.

View on arXiv PDF

Similar