CLMar 5, 2025

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation

arXiv:2503.03308v11009 citationsh-index: 58Has CodeFindings
Originality Incremental advance
AI Analysis

This work addresses the challenge of ensuring translations align with common sense for users of machine translation systems, though it is incremental as it focuses on evaluation rather than improving translation models directly.

The authors tackled the problem of evaluating commonsense reasoning in neural machine translation by creating a test suite with 1,200 triples covering lexical and syntactic ambiguities, and found that neural machine translation performs poorly with a reasoning accuracy of 60.1% and consistency of 31%.

Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy (60.1%) and reasoning consistency (31%). The built commonsense test suite is available at https://github.com/tjunlp-lab/CommonMT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes