Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset
This addresses the problem of limited speech recognition resources for Sorani Kurdish speakers, particularly in educational contexts, but it is incremental as it applies an existing method to new data.
The authors tackled the lack of automatic speech recognition for Sorani Kurdish by creating an experimental dataset (BD-4SK-ASR) and using CMUSphinx to develop a system for recognizing simple sentences from primary school vocabulary, but no performance results or concrete numbers are reported.
We present an experimental dataset, Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR), which we used in the first attempt in developing an automatic speech recognition for Sorani Kurdish. The objective of the project was to develop a system that automatically could recognize simple sentences based on the vocabulary which is used in grades one to three of the primary schools in the Kurdistan Region of Iraq. We used CMUSphinx as our experimental environment. We developed a dataset to train the system. The dataset is publicly available for non-commercial use under the CC BY-NC-SA 4.0 license.