CLOct 20, 2025

Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

arXiv:2510.18077v11 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of context-aware translation for NLP practitioners, but it is incremental as it applies an existing reasoning technique to a specific translation benchmark.

The paper tackled the problem of translating texts with inter-sentential dependencies using large language models, finding that chain-of-thought reasoning prompts improved accuracy to about 90% on a discrimination task and COMET scores to about 92% on a generation task, with GPT-4, GPT-4o, and Phi performing best.

This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a "wise get wiser" effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes