CLApr 16, 2025

Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

Miguel Moura Ramos, Patrick Fernandes, Sweta Agrawal, André F. T. Martins

arXiv:2504.12140v212.08 citationsh-index: 20

Originality Incremental advance

AI Analysis

This work addresses the problem of document-level machine translation for multilingual applications, offering incremental improvements through targeted fine-tuning and multiple translation paradigms.

The paper tackles the challenge of scaling large language models (LLMs) to document-level machine translation by proposing a method that fine-tunes LLMs on curated document-level data (DocBlocks) to better capture long-range dependencies and discourse phenomena. The result shows improved translation quality and inference speed compared to prompting and agent-based methods.

Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.

View on arXiv PDF

Similar