CLAILGMay 27, 2022

TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

arXiv:2206.03933v1588 citationsh-index: 20
Originality Synthesis-oriented
AI Analysis

This provides a public toolkit and benchmark for researchers and practitioners working on Arabic machine translation, but it is incremental as it builds on existing models and data.

The authors tackled the problem of machine translation into Modern Standard Arabic by developing TURJUMAN, a toolkit based on the AraT5 Transformer model, which supports translation from 20 languages and includes diverse decoding methods for paraphrasing, and they released a new benchmark dataset called AraOPUS-20.

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes