CL LGJan 27, 2025

A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

Jorge del Pozo Lérida, Kamil Kojs, János Máté, Mikołaj Antoni Barański, Christian Hardmeier

arXiv:2501.16533v12.7h-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses computational challenges in biomedical machine translation for English-Polish, but is incremental as it compares existing filtering methods in a specific domain.

This paper evaluated data filtering techniques (LASER, MUSE, LaBSE) for English-Polish machine translation in the biomedical domain, finding that LASER and MUSE reduced dataset sizes while maintaining or enhancing performance, with LASER providing the most fluent translations.

Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

View on arXiv PDF

Similar