SignMouth: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
This work addresses the challenge of accurate sign language translation for inclusive communication by integrating non-manual cues, representing an incremental improvement over existing methods.
The paper tackles the problem of sign language translation by incorporating mouthing cues, which are often overlooked, to disambiguate visually similar signs, resulting in improved BLEU-4 from 24.32 to 24.71 and ROUGE from 46.57 to 48.38 on the PHOENIX14T dataset.
Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.