CLOct 23, 2024

Dialectal and Low-Resource Machine Translation for Aromanian

arXiv:2410.17728v24 citationsh-index: 7Has CodeCOLING
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of low-resource and dialectal machine translation for Aromanian, contributing to language preservation efforts, though it is incremental in building on existing methods for under-resourced languages.

This paper tackled the problem of machine translation for Aromanian, an endangered language, by creating the largest Aromanian-Romanian parallel corpus with 79,000 sentence pairs and developing optimized models, achieving competitive results in translation tasks.

This paper presents the process of building a neural machine translation system with support for English, Romanian, and Aromanian - an endangered Eastern Romance language. The primary contribution of this research is twofold: (1) the creation of the most extensive Aromanian-Romanian parallel corpus to date, consisting of 79,000 sentence pairs, and (2) the development and comparative analysis of several machine translation models optimized for Aromanian. To accomplish this, we introduce a suite of auxiliary tools, including a language-agnostic sentence embedding model for text mining and automated evaluation, complemented by a diacritics conversion system for different writing standards. This research brings contributions to both computational linguistics and language preservation efforts by establishing essential resources for a historically under-resourced language. All datasets, trained models, and associated tools are public: https://huggingface.co/aronlp and https://arotranslate.com

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes