CLAug 18, 2025

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

arXiv:2508.13079v25 citationsh-index: 13Has CodeProceedings of the Tenth Conference on Machine Translation
Originality Incremental advance
AI Analysis

This provides essential infrastructure for advancing multilingual document-level translation, benefiting global communities and researchers in long-context modeling.

The authors tackled the scarcity of document-level machine translation datasets by creating DocHPLT, the largest publicly available dataset with 124 million aligned document pairs across 50 languages, and showed that fine-tuning LLMs on it substantially outperforms baselines, especially for under-resourced languages.

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes