Keypoint based Sign Language Translation without Glosses
This addresses the challenge of making sign language accessible to non-disabled people by enabling direct translation, though it appears incremental as it builds on existing translation models with specific enhancements.
The paper tackled the problem of translating sign language videos directly into spoken language without using glosses, proposing a keypoint normalization method and stochastic frame selection, which improved performance as demonstrated in quantitative experiments.
Sign Language Translation (SLT) is a task that has not been studied relatively much compared to the study of Sign Language Recognition (SLR). However, the SLR is a study that recognizes the unique grammar of sign language, which is different from the spoken language and has a problem that non-disabled people cannot easily interpret. So, we're going to solve the problem of translating directly spoken language in sign language video. To this end, we propose a new keypoint normalization method for performing translation based on the skeleton point of the signer and robustly normalizing these points in sign language translation. It contributed to performance improvement by a customized normalization method depending on the body parts. In addition, we propose a stochastic frame selection method that enables frame augmentation and sampling at the same time. Finally, it is translated into the spoken language through an Attention-based translation model. Our method can be applied to various datasets in a way that can be applied to datasets without glosses. In addition, quantitative experimental evaluation proved the excellence of our method.