CLAINov 14, 2025

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

arXiv:2511.10984v12 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation in machine translation for expert domains like scholarly communication, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the problem of inadequate evaluation for discourse-level translation in expert domains by introducing DiscoX, a benchmark with 200 professionally-curated texts, and Metric-S, a reference-free evaluation system that shows strong consistency with human judgments and outperforms existing metrics, revealing that advanced LLMs still trail human experts.

The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes