CLJul 27, 2023

Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

Mohammad Majd Saad Al Deen, Maren Pielka, Jörn Hees, Bouthaina Soulef Abdou, Rafet Sifa

arXiv:2307.14666v10.51 citationsh-index: 28Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the resource-poor nature of Arabic NLP, providing a first large-scale evaluation for Natural Language Inference and Contradiction Detection in Arabic, though it is incremental in applying existing methods to a new language context.

The paper tackled the problem of limited NLP resources for Arabic by creating a new dataset and applying transformer models with linguistically informed pre-training, finding that AraBERT with Named Entity Recognition pre-training performs competitively with state-of-the-art multilingual approaches.

This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.

View on arXiv PDF Code

Similar