CVJun 8, 2025

Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition

Siyuan Jing, Guangxue Wang, Haoyang Zhai, Qin Tao, Jun Yang, Bing Wang, Peng Jin

arXiv:2506.06966v13.62 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of hand occlusions in sign language recognition for the Chinese deaf community by providing a new dataset and baseline method, though it is incremental in nature.

The paper tackled the problem of isolated sign language recognition (ISLR) by introducing a dual-view dataset covering the full Chinese national sign language vocabulary and a CNN-Transformer hybrid network with a fusion strategy, achieving significant performance improvements in experiments.

Due to the emergence of many sign language datasets, isolated sign language recognition (ISLR) has made significant progress in recent years. In addition, the development of various advanced deep neural networks is another reason for this breakthrough. However, challenges remain in applying the technique in the real world. First, existing sign language datasets do not cover the whole sign vocabulary. Second, most of the sign language datasets provide only single view RGB videos, which makes it difficult to handle hand occlusions when performing ISLR. To fill this gap, this paper presents a dual-view sign language dataset for ISLR named NationalCSL-DP, which fully covers the Chinese national sign language vocabulary. The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views, namely, the front side and the left side. Furthermore, a CNN transformer network is also proposed as a strong baseline and an extremely simple but effective fusion strategy for prediction. Extensive experiments were conducted to prove the effectiveness of the datasets as well as the baseline. The results show that the proposed fusion strategy can significantly increase the performance of the ISLR, but it is not easy for the sequence-to-sequence model, regardless of whether the early-fusion or late-fusion strategy is applied, to learn the complementary features from the sign videos of two vertical views.

View on arXiv PDF

Similar