Context-Gloss Augmentation for Improving Arabic Target Sense Verification
This work addresses the problem of limited semantic resources for Arabic natural language processing, but it is incremental as it builds on an existing dataset with modest performance gains.
The paper tackled the lack of semantic datasets for Arabic by augmenting the ArabGlossBERT dataset using machine back-translation, increasing it from 167K to 352K context-gloss pairs, and achieved accuracy between 78% to 84% on target sense verification tasks.
Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).