Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models
This provides a resource for training Arabic small language models, but it is incremental as it applies an existing machine translation method to new data.
The authors tackled the scarcity of high-quality Arabic data for multilingual large language models by creating FineWeb-Edu-Ar, a machine-translated dataset from English, resulting in the largest publicly available Arabic dataset with 202B tokens.
As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset generation approaches. A key technique in this space is machine translation (MT), where high-quality English text is adapted to a target, comparatively low-resource language. This report introduces FineWeb-Edu-Ar, a machine-translated version of the exceedingly popular (deduplicated) FineWeb-Edu dataset from HuggingFace. To the best of our knowledge, FineWeb-Edu-Ar is the largest publicly available machine-translated Arabic dataset out there, with its size of 202B tokens of an Arabic-trained tokenizer.