CLCVNov 25, 2024

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

arXiv:2411.16765v315 citationsh-index: 56ACL
Originality Highly original
AI Analysis

This addresses the need for better pre-training methods in sign language processing, offering a novel approach that leverages unlabeled data and contextual relationships, though it is domain-specific to sign language.

The paper tackles the problem of limited transfer learning in sign language processing by introducing SHuBERT, a self-supervised contextual representation model trained on 1,000 hours of American Sign Language video, which achieves state-of-the-art performance across tasks like sign language translation, isolated recognition, and fingerspelling detection.

Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. Pre-training methods for sign language have typically focused on either supervised pre-training, which cannot take advantage of unlabeled data, or context-independent (frame or video segment) representations, which ignore the effects of relationships across time in sign language. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes