AIJan 29

Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

Václav Javorek, Tomáš Železný, Alessa Carbo, Marek Hrúz, Ivan Gruber

arXiv:2601.21128v12.4h-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable evaluation in SLT for researchers and developers, though it is incremental as it builds on existing metrics and methods.

The paper tackles the limitation of single reference translations in Sign Language Translation (SLT) by using Large Language Models to generate paraphrased variants, finding that incorporating paraphrases during evaluation improves automatic scores and aligns better with human judgments, leading to the introduction of BLEUpara which shows stronger correlation with perceived quality.

Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.

View on arXiv PDF

Similar