CVMay 20, 2024

Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining

arXiv:2405.12018v16 citationsh-index: 35
Originality Incremental advance
AI Analysis

This work addresses sign language recognition for accessibility applications, but it is incremental as it adapts an existing model to a new domain.

The authors tackled continuous sign language recognition by adapting the Conformer model from speech recognition to a vision-based task, introducing Cross-Modal Relative Attention and unsupervised pretraining, achieving state-of-the-art performance on PHOENIX-2014 and PHOENIX-2014T datasets.

Conventional Deep Learning frameworks for continuous sign language recognition (CSLR) are comprised of a single or multi-modal feature extractor, a sequence-learning module, and a decoder for outputting the glosses. The sequence learning module is a crucial part wherein transformers have demonstrated their efficacy in the sequence-to-sequence tasks. Analyzing the research progress in the field of Natural Language Processing and Speech Recognition, a rapid introduction of various transformer variants is observed. However, in the realm of sign language, experimentation in the sequence learning component is limited. In this work, the state-of-the-art Conformer model for Speech Recognition is adapted for CSLR and the proposed model is termed ConSignformer. This marks the first instance of employing Conformer for a vision-based task. ConSignformer has bimodal pipeline of CNN as feature extractor and Conformer for sequence learning. For improved context learning we also introduce Cross-Modal Relative Attention (CMRA). By incorporating CMRA into the model, it becomes more adept at learning and utilizing complex relationships within the data. To further enhance the Conformer model, unsupervised pretraining called Regressional Feature Extraction is conducted on a curated sign language dataset. The pretrained Conformer is then fine-tuned for the downstream recognition task. The experimental results confirm the effectiveness of the adopted pretraining strategy and demonstrate how CMRA contributes to the recognition process. Remarkably, leveraging a Conformer-based backbone, our model achieves state-of-the-art performance on the benchmark datasets: PHOENIX-2014 and PHOENIX-2014T.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes