CLSep 2, 2017

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

arXiv:1709.00616v11093 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving Arabic NLP applications by simplifying segmentation tools, though it is incremental as it builds on existing methods.

The paper tackled the problem of Arabic word segmentation by exploring three language-independent alternatives to morphological segmentation, achieving close to or occasionally surpassing state-of-the-art performance in machine translation and POS tagging, with optimal performance observed when the source-to-target token ratio is close to or greater than 1.

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes