CLJan 20

On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

arXiv:2601.13729v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses evaluation problems for non-deterministic machine translation systems, which is incremental for NLP researchers and practitioners.

The study identified temperature-constrained non-deterministic machine translation (ND-MT) as a distinct phenomenon with potential to address multi-modality issues, showing it provides higher-quality candidates than deterministic MT under temperature constraints, but introduced evaluation challenges where the lowest-quality candidate determines system rankings across metrics.

In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes