CLDec 15, 2023

VK-G2T: Vision and Context Knowledge enhanced Gloss2Text

arXiv:2312.10210v11 citationsh-index: 35ICASSP
Originality Incremental advance
AI Analysis

This work addresses the challenge of translating gloss sequences to spoken language for sign language users, but it is incremental as it focuses on optimizing one stage of an existing pipeline.

The paper tackles the problem of improving the Gloss2Text stage in sign language translation by addressing isolated gloss input and low-capacity gloss vocabulary, resulting in a model that outperforms existing methods on a Chinese benchmark.

Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text). While previous studies have focused on boosting the performance of the Sign2Gloss stage, we emphasize the optimization of the Gloss2Text stage. However, this task is non-trivial due to two distinct features of Gloss2Text: (1) isolated gloss input and (2) low-capacity gloss vocabulary. To address these issues, we propose a vision and context knowledge enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of the sign language video to learn the properties of the target sentence and exploit the context knowledge to facilitate the adaptive translation of gloss words. Extensive experiments conducted on a Chinese benchmark validate the superiority of our model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes