CVAIJul 22, 2024

SLVideo: A Sign Language Video Moment Retrieval Framework

arXiv:2407.15668v2h-index: 12
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in sign language video retrieval technology for users needing to search and annotate sign language content, though it is incremental as it builds on existing methods like CLIP.

The authors tackled the problem of retrieving specific sign language video segments using text queries by developing SLVideo, a framework that incorporates facial expressions and hand signs, and reported promising initial results in a zero-shot setting on an eight-hour annotated Portuguese Sign Language dataset.

SLVideo is a video moment retrieval system for Sign Language videos that incorporates facial expressions, addressing this gap in existing technology. The system extracts embedding representations for the hand and face signs from video frames to capture the signs in their entirety, enabling users to search for a specific sign language video segment with text queries. A collection of eight hours of annotated Portuguese Sign Language videos is used as the dataset, and a CLIP model is used to generate the embeddings. The initial results are promising in a zero-shot setting. In addition, SLVideo incorporates a thesaurus that enables users to search for similar signs to those retrieved, using the video segment embeddings, and also supports the edition and creation of video sign language annotations. Project web page: https://novasearch.github.io/SLVideo/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes