CL AIJun 6, 2023

Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text

Parnia Bahar, Mattia Di Gangi, Nick Rossenbach, Mohammad Zeineldeen

arXiv:2306.03557v21.33 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work improves diacritization accuracy for applications like language learning and speech synthesis, though it is incremental by building on existing methods with optional human hints.

The paper tackles the problem of automatic Arabic diacritization by proposing a model that uses partially-diacritized text as input, achieving state-of-the-art results with over 60% fewer parameters on common benchmarks.

Automatic Arabic diacritization is useful in many applications, ranging from reading support for language learners to accurate pronunciation predictor for downstream tasks like speech synthesis. While most of the previous works focused on models that operate on raw non-diacritized text, production systems can gain accuracy by first letting humans partly annotate ambiguous words. In this paper, we propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions. We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking. We show that the provided hints during test affect more output positions than those annotated. Moreover, experiments on two common benchmarks show that our approach i) greatly outperforms the baseline also when evaluated on non-diacritized text; and ii) achieves state-of-the-art results while reducing the parameter count by over 60%.

View on arXiv PDF Code

Similar