CLSep 4, 2024

Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus

arXiv:2409.02667v12 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the need for translators and language professionals to efficiently build custom translation memories for domain-specific tasks like machine translation training, though it is incremental in nature.

The paper tackles the problem of creating domain-specific translation memories for machine translation fine-tuning by introducing a semi-automatic methodology to compile parallel corpora, resulting in the TRENCARD bilingual cardiology corpus with approximately 800,000 source words and 50,000 sentences.

This article investigates how translation memories (TM) can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish -> English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.

View on arXiv PDF

Similar