Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers
This work addresses the problem of Quranic recitation learning for non-Arabic speakers by creating a dataset, but it is incremental as it builds on existing crowdsourcing and annotation methods.
The paper tackled the challenge of learning Quranic recitation for non-Arabic speakers by crowdsourcing and labeling an audio dataset, resulting in the collection of around 7000 recitations from 1287 participants across over 11 countries and annotation of 1166 recitations with crowd accuracy of 0.77 and algorithm-expert agreement of 0.89.
This paper addresses the challenge of learning to recite the Quran for non-Arabic speakers. We explore the possibility of crowdsourcing a carefully annotated Quranic dataset, on top of which AI models can be built to simplify the learning process. In particular, we use the volunteer-based crowdsourcing genre and implement a crowdsourcing API to gather audio assets. We integrated the API into an existing mobile application called NamazApp to collect audio recitations. We developed a crowdsourcing platform called Quran Voice for annotating the gathered audio assets. As a result, we have collected around 7000 Quranic recitations from a pool of 1287 participants across more than 11 non-Arabic countries, and we have annotated 1166 recitations from the dataset in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater agreement of 0.63 between the annotators, and 0.89 between the labels assigned by the algorithm and the expert judgments.