CVCLMar 17

STARK: Spatio-Temporal Attention for Representation of Keypoints for Continuous Sign Language Recognition

arXiv:2603.1616315.8h-index: 1
AI Analysis

This work addresses efficiency in CSLR for deaf communities, but it is incremental as it builds on existing keypoint-based methods with a focus on parameter reduction.

The paper tackles the problem of high parameter count in keypoint-based continuous sign language recognition by introducing a unified spatio-temporal attention network, which reduces parameters by 70-80% while achieving comparable performance on the Phoenix-14T dataset.

Continuous Sign Language Recognition (CSLR) is a crucial task for understanding the languages of deaf communities. Contemporary keypoint-based approaches typically rely on spatio-temporal encoding, where spatial interactions among keypoints are modeled using Graph Convolutional Networks or attention mechanisms, while temporal dynamics are captured using 1D convolutional networks. However, such designs often introduce a large number of parameters in both the encoder and the decoder. This paper introduces a unified spatio-temporal attention network that computes attention scores both spatially (across keypoints) and temporally (within local windows), and aggregates features to produce a local context-aware spatio-temporal representation. The proposed encoder contains approximately $70-80\%$ fewer parameters than existing state-of-the-art models while achieving comparable performance to keypoint-based methods on the Phoenix-14T dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes