CLApr 13

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

arXiv:2604.1106677.6h-index: 1
Predicted impact top 77% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

Provides a much-needed, clean, large-scale pretraining resource for the low-resource Kashmiri language, enabling future language model development.

The authors created KS-PRET-5M, the largest public Kashmiri pretraining dataset (5.09M words, 12.13M tokens), achieving 0.9965 script purity via an 11-stage cleaning pipeline. The dataset is released under CC BY 4.0 for NLP research.

We present KS-PRET-5M, the largest publicly available pretraining dataset for the Kashmiri language, comprising 5,090,244 (5.09M) words, 27,692,959 (27.6M) characters, and a vocabulary of 295,433 (295.4K) unique word types. We assembled the dataset from two source classes: digitized archival and literary material, encompassing literature, news, biographies, novels, poetry, religious scholarship, and academic writing, recovered from the proprietary InPage desktop-publishing format using the converter of Malik~\cite{malik2024inpage}, and Unicode-native text collected from Kashmiri-language web sources. All text was processed through an eleven-stage cleaning pipeline that achieves a mean Kashmiri script ratio of 0.9965, reducing Devanagari contamination to 146 characters across the full dataset. We tokenized the dataset empirically using google/muril-base-cased, yielding a subword ratio of 2.383 tokens per word and a total of approximately 12.13 million subword tokens, substantially higher than prior estimates derived from non-Kashmiri Perso-Arabic analogues. KS-PRET-5M is released as a single continuous text stream under CC~BY~4.0 to support language model pretraining, tokenizer training, and computational linguistic research for Kashmiri.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes