CLDBNov 25, 2024

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

arXiv:2411.16387v11 citationsh-index: 18Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited pretraining resources for Traditional Chinese users, though it is incremental as it builds on existing English curation efforts.

The paper tackles the lack of high-quality pretraining datasets for Traditional Chinese by introducing FineWeb-zhtw, a curated dataset from web text, with results including publicly available code and data to support LLM development.

The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes