WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

Zien Sheikh Ali, Hamdy Mubarak, Soon-Gyo Jung, Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

arXiv:2605.1636485.1

Predicted impact top 9% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For researchers building Arabic voice assistants, this dataset and evaluation methodology address the lack of in-the-wild spoken interaction data with explicit feedback and answerability annotations.

The paper introduces WASIL, a dataset of 8,529 Arabic spoken interaction turns with ASR hypotheses, assistant responses, and like/dislike feedback, plus a 2,000-turn test set covering MSA and four dialects. It provides gold transcripts via multi-ASR agreement-guided post-editing and annotates answerability to isolate ASR effects, enabling reference-free evaluation of LLM responses.

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

View on arXiv PDF

Similar