CLJan 24, 2025

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

Jia Yu, Fei Yuan, Rui Min, Jing Yu, Pei Chu, Jiayang Li, Wei Li, Ruijie Zhang, Zhenxiang Li, Zhifei Ren, Dong Zheng, Wenjian Zhang

arXiv:2501.14506v12.71 citationsh-index: 30Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the lack of high-quality training data for low-resource languages, benefiting multilingual model research, though it is incremental as it builds on existing data collection methods.

The paper introduces WanJuanSiLu, an open-source webtext dataset for low-resource languages, developed using a systematic processing framework to enhance quality, security, and linguistic diversity, with data for five languages fully released.

This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0

View on arXiv PDF Code

Similar