CVAug 30, 2023

SignDiff: Diffusion Model for American Sign Language Production

arXiv:2308.16082v426 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the challenge of producing accurate and high-quality ASL videos from text, which is significant for improving accessibility and communication tools for the deaf and hard-of-hearing community, though it appears incremental with novel modules and loss functions.

The authors tackled the problem of generating American Sign Language (ASL) skeletal pose videos from text input, achieving state-of-the-art results with scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dataset and a 10 percentage point improvement in SSIM for image quality.

In this paper, we propose a dual-condition diffusion pre-training model named SignDiff that can generate human sign language speakers from a skeleton pose. SignDiff has a novel Frame Reinforcement Network called FR-Net, similar to dense human pose estimation work, which enhances the correspondence between text lexical symbols and sign language dense pose frames, reduces the occurrence of multiple fingers in the diffusion model. In addition, we propose a new method for American Sign Language Production (ASLP), which can generate ASL skeletal pose videos from text input, integrating two new improved modules and a new loss function to improve the accuracy and quality of sign language skeletal posture and enhance the ability of the model to train on large-scale data. We propose the first baseline for ASL production and report the scores of 17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We evaluated our model on the previous mainstream dataset PHOENIX14T, and the experiments achieved the SOTA results. In addition, our image quality far exceeds all previous results by 10 percentage points in terms of SSIM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes