CL AISep 4, 2025

SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani

arXiv:2509.03791v14 citationsh-index: 17RANLP

Originality Highly original

AI Analysis

This addresses the ambiguity in back-translation evaluation for sign language generation, which fails to capture multimodal aspects and confounds errors from generation and translation systems.

The paper tackles the problem of evaluating sign language generation by proposing SiLVERScore, a semantically-aware embedding-based metric that assesses generation in a joint embedding space, achieving near-perfect discrimination with ROC AUC = 0.99 and overlap < 7% on PHOENIX-14T and CSL-Daily datasets.

Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

View on arXiv PDF

Similar