CLDec 12, 2025

Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello, Hideyuki Takada

arXiv:2512.11388v1h-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of data selection for fine-tuning in machine translation, but it is incremental as it compares existing selectors rather than introducing a new one.

The study tackled the problem of improving machine translation quality by selecting better data for fine-tuning large language models, finding that semantic selectors consistently outperform other methods and that even small differences in selected data (less than 3%) lead to substantial performance impacts.

We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.

View on arXiv PDF

Similar