CVCLAug 28, 2019

Fingerspelling recognition in the wild with iterative visual attention

arXiv:1908.10546v176 citations
AI Analysis

This addresses the problem of sign language recognition in real-life settings for accessibility applications, though it is incremental as it builds on existing attention mechanisms.

The paper tackled fingerspelling recognition in American Sign Language videos from uncontrolled real-world sources like YouTube, proposing an end-to-end model with iterative visual attention that outperforms prior work by a large margin, and introduced a new crowdsourced dataset to further improve performance.

Sign language recognition is a challenging gesture sequence recognition problem, characterized by quick and highly coarticulated motion. In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media. Most previous work on sign language recognition has focused on controlled settings where the data is recorded in a studio environment and the number of signers is limited. Our work aims to address the challenges of real-life data, reducing the need for detection or segmentation modules commonly used in this domain. We propose an end-to-end model based on an iterative attention mechanism, without explicit hand detection or segmentation. Our approach dynamically focuses on increasingly high-resolution regions of interest. It outperforms prior work by a large margin. We also introduce a newly collected data set of crowdsourced annotations of fingerspelling in the wild, and show that performance can be further improved with this additional data set.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes