Watch, read and lookup: learning to spot signs from multiple supervisors
This work addresses sign spotting for sign language recognition, enabling better video analysis, but it is incremental as it builds on existing methods with new supervision types.
The paper tackles the problem of spotting isolated signs in continuous sign language videos by training a model using multiple supervision sources: sparsely labeled footage, subtitles for weak supervision, and visual dictionaries for novel signs. The approach, validated on low-shot benchmarks, integrates these tasks via Noise Contrastive Estimation and Multiple Instance Learning, and includes a new BSL dictionary dataset.
The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page.