SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences
This work addresses the problem of dataset scarcity for sign language researchers, though it is incremental as it builds on existing datasets and methods.
The paper tackled the challenge of limited and costly sign language datasets by developing SkelCap, a model that generates textual descriptions from skeleton keypoint sequences, achieving a ROUGE-L score of 0.98 and BLEU-4 score of 0.94 in signer-agnostic evaluation.
Numerous sign language datasets exist, yet they typically cover only a limited selection of the thousands of signs used globally. Moreover, creating diverse sign language datasets is an expensive and challenging task due to the costs associated with gathering a varied group of signers. Motivated by these challenges, we aimed to develop a solution that addresses these limitations. In this context, we focused on textually describing body movements from skeleton keypoint sequences, leading to the creation of a new dataset. We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset. We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements. This model processes the skeleton keypoints data as a vector, applies a fully connected layer for embedding, and utilizes a transformer neural network for sequence-to-sequence modeling. We conducted extensive evaluations of our model, including signer-agnostic and sign-agnostic assessments. The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. The dataset we have prepared, namely the AUTSL-SkelCap, will be made publicly available soon.