MIT-QCRI Arabic Dialect Identification System for the 2017 Multi-Genre Broadcast Challenge
This work addresses the need for dialect identification in Arabic speech processing for media annotation, but it is incremental as it builds on existing methods for a specific challenge.
The researchers tackled the problem of identifying Arabic dialects in broadcast media by developing a system for the MGB-3 challenge, achieving 75% accuracy on a 10-hour test set.
In order to successfully annotate the Arabic speech con- tent found in open-domain media broadcasts, it is essential to be able to process a diverse set of Arabic dialects. For the 2017 Multi-Genre Broadcast challenge (MGB-3) there were two possible tasks: Arabic speech recognition, and Arabic Dialect Identification (ADI). In this paper, we describe our efforts to create an ADI system for the MGB-3 challenge, with the goal of distinguishing amongst four major Arabic dialects, as well as Modern Standard Arabic. Our research fo- cused on dialect variability and domain mismatches between the training and test domain. In order to achieve a robust ADI system, we explored both Siamese neural network models to learn similarity and dissimilarities among Arabic dialects, as well as i-vector post-processing to adapt domain mismatches. Both Acoustic and linguistic features were used for the final MGB-3 submissions, with the best primary system achieving 75% accuracy on the official 10hr test set.