CVJul 14, 2025

Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation

arXiv:2507.10306v1h-index: 2IVA
Originality Incremental advance
AI Analysis

This work addresses the challenge of costly gloss annotations in sign language translation for deaf and hard-of-hearing communities, representing an incremental improvement over existing gloss-free approaches.

The paper tackles the problem of gloss-free sign language translation by proposing a dual visual encoder framework with contrastive pretraining, achieving the highest BLEU-4 score on the Phoenix-2014T benchmark among gloss-free methods.

Sign Language Translation (SLT) aims to convert sign language videos into spoken or written text. While early systems relied on gloss annotations as an intermediate supervision, such annotations are costly to obtain and often fail to capture the full complexity of continuous signing. In this work, we propose a two-phase, dual visual encoder framework for gloss-free SLT, leveraging contrastive visual-language pretraining. During pretraining, our approach employs two complementary visual backbones whose outputs are jointly aligned with each other and with sentence-level text embeddings via a contrastive objective. During the downstream SLT task, we fuse the visual features and input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our dual encoder architecture consistently outperforms its single stream variants and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes