CVMMJun 26, 2025

Hierarchical Sub-action Tree for Continuous Sign Language Recognition

arXiv:2506.20947v1h-index: 3ICME
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of insufficient training data and alignment issues in CSLR, which is important for improving accessibility tools for deaf and hard-of-hearing communities, but it appears incremental as it builds on existing cross-modal solutions.

The paper tackles the problem of continuous sign language recognition (CSLR) by proposing a Hierarchical Sub-action Tree (HST-CSLR) to better combine gloss knowledge with visual representation learning, resulting in improved performance demonstrated on four datasets including PHOENIX-2014 and CSL-Daily.

Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes