CVHCAug 23, 2021

ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

arXiv:2108.10059v110 citations
Originality Incremental advance
AI Analysis

This addresses the problem of limited labeled data for sign language recognition, enabling recognition without manual annotations, though it is incremental as it builds on existing zero-shot and transformer methods.

The paper tackles the annotation bottleneck in Sign Language Recognition (SLR) by proposing a zero-shot SLR model using RGB-D videos, achieving state-of-the-art results on four datasets.

Sign Language Recognition (SLR) is a challenging research area in computer vision. To tackle the annotation bottleneck in SLR, we formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on four datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, and NTU-60, obtaining state-of-the-art results compared to state-of-the-art ZS-SLR models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes