CLJan 15, 2024

On the importance of Data Scale in Pretraining Arabic Language Models

arXiv:2401.07760v1h-index: 16Has Code
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks in Arabic NLP by emphasizing data scale, though it is incremental as it builds on existing models.

The study demonstrates that retraining state-of-the-art Arabic language models on massive-scale, high-quality corpora significantly improves performance, achieving top results on ALUE and ORCA leaderboards and showing that pretraining data is the primary factor for gains.

Pretraining monolingual language models have been proven to be vital for performance in Arabic Natural Language Processing (NLP) tasks. In this paper, we conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs). More precisely, we reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora. We have significantly improved the performance of the leading Arabic encoder-only BERT-base and encoder-decoder T5-base models on the ALUE and ORCA leaderboards, thereby reporting state-of-the-art results in their respective model categories. In addition, our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors. Our models and source code are publicly available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/JABER-PyTorch.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes