Large-Scale Machine Translation between Arabic and Hebrew: Available Corpora and Initial Results
This work addresses the lack of parallel corpora for Arabic-Hebrew translation, which is important for political and cultural reasons, but it is incremental as it applies existing methods to this specific language pair.
The authors tackled the problem of machine translation between Arabic and Hebrew by comparing phrase-based and neural systems, showing that both tokenization and sub-word modeling improved performance, with neural models achieving a small advantage.
Machine translation between Arabic and Hebrew has so far been limited by a lack of parallel corpora, despite the political and cultural importance of this language pair. Previous work relied on manually-crafted grammars or pivoting via English, both of which are unsatisfactory for building a scalable and accurate MT system. In this work, we compare standard phrase-based and neural systems on Arabic-Hebrew translation. We experiment with tokenization by external tools and sub-word modeling by character-level neural models, and show that both methods lead to improved translation performance, with a small advantage to the neural models.