MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language
This addresses the scarcity of labeled data for sign language recognition, enabling better generalization to unseen signers and more realistic applications, though it is incremental in improving dataset scale and model performance.
The authors tackled the problem of sign language recognition by creating MS-ASL, a large-scale dataset with over 25,000 annotated videos covering 1000 signs, and proposed the I3D architecture, which outperformed the state-of-the-art by a large margin.
Sign language recognition is a challenging and often underestimated problem comprising multi-modal articulators (handshape, orientation, movement, upper body and face) that integrate asynchronously on multiple streams. Learning powerful statistical models in such a scenario requires much data, particularly to apply recent advances of the field. However, labeled data is a scarce resource for sign language due to the enormous cost of transcribing these unwritten languages. We propose the first real-life large-scale sign language data set comprising over 25,000 annotated videos, which we thoroughly evaluate with state-of-the-art methods from sign and related action recognition. Unlike the current state-of-the-art, the data set allows to investigate the generalization to unseen individuals (signer-independent test) in a realistic setting with over 200 signers. Previous work mostly deals with limited vocabulary tasks, while here, we cover a large class count of 1000 signs in challenging and unconstrained real-life recording conditions. We further propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition, outperforming the current state-of-the-art by a large margin. The data set is publicly available to the community.