naab: A ready-to-use plug-and-play corpus for Farsi
This provides a valuable resource for NLP researchers and practitioners focusing on low-resource languages, though it is incremental as it applies existing methods to new data.
The authors tackled the performance gap of large language models in low-resource languages like Farsi by introducing naab, the largest publicly available, cleaned Farsi textual corpus, consisting of 130GB of data with over 250 million paragraphs and 15 billion words.
The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word NAAB (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.