CLSep 16, 2025

Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data

Kurt Micallef, Nizar Habash, Claudia Borg

arXiv:2509.12853v24.91 citationsh-index: 8EMNLP

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of low-resource language processing for Maltese, which is incremental as it adapts existing techniques to a specific domain.

The paper tackled the problem of limited resources for Maltese natural language processing by exploring cross-lingual augmentation using Arabic data, and demonstrated that this approach significantly benefits Maltese NLP tasks.

Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.

View on arXiv PDF

Similar