CL SDFeb 15

From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash

arXiv:2602.14062v1

Originality Synthesis-oriented

AI Analysis

It addresses the scarcity of openly licensed speech data for Pashto, a language spoken by over 60 million people, by providing a quantitative audit to guide improvements in dataset maturity for ASR development.

This paper analyzed the Pashto Common Voice dataset, documenting its growth from 1.49 hours in 2023 to 2,768.7 hours in 2025, with 975.89 validated hours for ASR training, and identified issues like high contributor concentration (Gini = 0.941) and incomplete metadata (41.97% lacking gender labels).

Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.

View on arXiv PDF

Similar