CVJul 31, 2025

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

arXiv:2507.23575v22 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the problem of translating sign language without gloss annotations for deaf and hard-of-hearing communities, representing a novel method for a known bottleneck.

The paper tackles the challenge of gloss-free sign language translation by introducing BeyondGloss, a framework that uses VideoLLMs with hand-centric descriptions and alignment modules, achieving state-of-the-art performance on Phoenix14T and CSL-Daily benchmarks.

Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes