CVMar 11, 2021

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Ben Saunders, Necati Cihan Camgoz, Richard Bowden

arXiv:2103.06982v120.0117 citations

Originality Highly original

AI Analysis

This work addresses the need for more natural and expressive sign language production to improve communication accessibility for the Deaf community, representing a novel advancement rather than an incremental improvement.

The paper tackles the problem of generating continuous 3D multi-channel sign language from spoken language sentences, achieving state-of-the-art results on the PHOENIX14T dataset with benchmark quantitative metrics and positive user evaluations from the Deaf community.

Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate.Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production. In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions.

View on arXiv PDF

Similar