CLJan 24, 2025

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

arXiv:2501.14506v11 citationsh-index: 30Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the lack of high-quality training data for low-resource languages, benefiting multilingual model research, though it is incremental as it builds on existing data collection methods.

The paper introduces WanJuanSiLu, an open-source webtext dataset for low-resource languages, developed using a systematic processing framework to enhance quality, security, and linguistic diversity, with data for five languages fully released.

This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes