CLAIMar 2, 2024

Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021

arXiv:2403.01196v1682 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of translating Covid data for Irish speakers, but it is incremental as it applies existing methods to a specific domain.

The study tackled machine translation for Covid-related data from English to Irish by developing and comparing domain adaptation techniques, resulting in a 27-point BLEU score improvement by extending an 8k in-domain dataset with 5k lines.

Translation models for the specific domain of translating Covid data from English to Irish were developed for the LoResMT 2021 shared task. Domain adaptation techniques, using a Covid-adapted generic 55k corpus from the Directorate General of Translation, were applied. Fine-tuning, mixed fine-tuning and combined dataset approaches were compared with models trained on an extended in-domain dataset. As part of this study, an English-Irish dataset of Covid related data, from the Health and Education domains, was developed. The highest-performing model used a Transformer architecture trained with an extended in-domain Covid dataset. In the context of this study, we have demonstrated that extending an 8k in-domain baseline dataset by just 5k lines improved the BLEU score by 27 points.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes