CVCLNov 25, 2024

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

arXiv:2411.16789v27 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of translating sign language into spoken language for accessibility, but it is incremental as it builds on existing MLLM capabilities.

The paper tackles sign language translation by proposing a gloss-free framework that uses multimodal large language models to generate textual descriptions of sign components and align them with video features, achieving state-of-the-art performance on PHOENIX14T and CSL-Daily datasets.

Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT. Code is available at https://github.com/hwjeon98/MMSLT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes