CVDec 15, 2025

USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

arXiv:2512.13415v23.6Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of accurately recognizing sign language from videos for accessibility applications, representing an incremental improvement over existing methods.

The paper tackles the problem of continuous sign language recognition by proposing the USTM framework, which achieves state-of-the-art performance on benchmark datasets like PHOENIX14, PHOENIX14T, and CSL-Daily, outperforming RGB-based and multi-modal approaches.

Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

View on arXiv PDF Code

Similar