HARNESS: Lightweight Distilled Arabic Speech Foundation Models

Vrunda N. Sukhadia, Shammur Absar Chowdhury

arXiv:2604.1418653.6h-index: 12

Predicted impact top 66% in AS · last 90 daysOriginality Incremental advance

AI Analysis

Provides practical, lightweight Arabic speech foundation models for resource-constrained deployment, addressing the gap in Arabic-centric SSL models.

HARNESS introduces a family of Arabic-centric self-supervised speech models trained with iterative self-distillation, achieving strong accuracy-efficiency trade-offs on ASR, DID, and SER. Compressed student models remain competitive despite substantial size reduction, outperforming HuBERT and XLS-R on Arabic tasks.

Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervised speech model family trained from scratch with iterative self-distillation, together with lightweight student variants that offer strong accuracy-efficiency trade-offs on Automatic Speech Recognition (ASR), Dialect Identification (DID), and Speech Emotion Recognition (SER). Our approach begins with a large bilingual Arabic-English teacher and progressively distills its knowledge into compressed student models while preserving Arabic-relevant acoustic and paralinguistic representations. We further study PCA-based compression of the teacher supervision signal to better match the capacity of shallow and thin students. Compared with HuBERT and XLS-R, HArnESS consistently improves performance on Arabic downstream tasks, while the compressed models remain competitive under substantial structural reduction. These results position HArnESS as a practical and accessible Arabic-centric SSL foundation for real-world speech applications.

View on arXiv PDF

Similar