Fingerspelling PoseNet: Enhancing Fingerspelling Translation with Pose-Based Transformer Models
This addresses fingerspelling recognition for sign language translation, offering incremental improvements over existing methods.
The paper tackles American Sign Language fingerspelling translation from videos by proposing a transformer-based architecture with a novel loss term for word length prediction and a two-stage inference approach, achieving over 10% relative improvement on ChicagoFSWild and ChicagoFSWild+ benchmarks.
We address the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finger-spelled word, benefiting both training and inference. We also propose a novel two-stage inference approach that re-ranks the hypotheses using the language model capabilities of the decoder. Through extensive experiments, we demonstrate that our proposed method outperforms the state-of-the-art models on ChicagoFSWild and ChicagoFSWild+ achieving more than 10% relative improvement in performance. Our findings highlight the effectiveness of our approach and its potential to advance fingerspelling recognition in sign language translation. Code is also available at https://github.com/pooyafayyaz/Fingerspelling-PoseNet.