CVJun 1

Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument

arXiv:2606.0164317.8

AI Analysis

For researchers in sign language production, this work diagnoses a critical failure in current evaluation and generation methods, highlighting the need for better metrics and larger paired datasets.

The paper identifies that existing metrics for Sign Language Production (SLP) can improve while the generated motion fails to faithfully represent sign language gestures. They propose three diagnostic levels (initial-pose conditioning, output diversity, target faithfulness) and show that faithfulness is never attained on How2Sign, with FID uncorrelated to faithfulness, and isolate dataset size as the bottleneck.

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: (τ1) initial-pose conditioning, (τ2) output diversity, and (τ3) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that τ3 faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable τ3 can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

View on arXiv PDF

Similar