CLSep 2, 2017

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Ahmed Abdelali, Yonatan Belinkov, Stephan Vogel

arXiv:1709.00616v139.21093 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving Arabic NLP applications by simplifying segmentation tools, though it is incremental as it builds on existing methods.

The paper tackled the problem of Arabic word segmentation by exploring three language-independent alternatives to morphological segmentation, achieving close to or occasionally surpassing state-of-the-art performance in machine translation and POS tagging, with optimal performance observed when the source-to-target token ratio is close to or greater than 1.

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

View on arXiv PDF

Similar