AraS2P: Arabic Speech-to-Phonemes System
This work addresses phoneme-level mispronunciation detection in Arabic, which is incremental as it builds on existing models with targeted adaptations.
The paper tackled the problem of Arabic speech-to-phonemes conversion for mispronunciation detection by adapting Wav2Vec2-BERT with a two-stage training strategy, achieving first place on the Iqra'Eval 2025 leaderboard.
This paper describes AraS2P, our speech-to-phonemes system submitted to the Iqra'Eval 2025 Shared Task. We adapted Wav2Vec2-BERT via Two-Stage training strategy. In the first stage, task-adaptive continue pretraining was performed on large-scale Arabic speech-phonemes datasets, which were generated by converting the Arabic text using the MSA Phonetiser. In the second stage, the model was fine-tuned on the official shared task data, with additional augmentation from XTTS-v2-synthesized recitations featuring varied Ayat segments, speaker embeddings, and textual perturbations to simulate possible human errors. The system ranked first on the official leaderboard, demonstrating that phoneme-aware pretraining combined with targeted augmentation yields strong performance in phoneme-level mispronunciation detection.