Using Motion History Images with 3D Convolutional Networks in Isolated Sign Language Recognition
This addresses sign language recognition for accessibility applications, but it is incremental as it builds on existing methods with a focus on using only RGB data.
The paper tackles isolated sign language recognition by proposing a model using Motion History Images (MHI) with 3D convolutional networks, achieving competitive performance with state-of-the-art models that use multi-modal data on datasets like AUTSL and BosphorusSign22k.
Sign language recognition using computational models is a challenging problem that requires simultaneous spatio-temporal modeling of the multiple sources, i.e. faces, hands, body, etc. In this paper, we propose an isolated sign language recognition model based on a model trained using Motion History Images (MHI) that are generated from RGB video frames. RGB-MHI images represent spatio-temporal summary of each sign video effectively in a single RGB image. We propose two different approaches using this RGB-MHI model. In the first approach, we use the RGB-MHI model as a motion-based spatial attention module integrated into a 3D-CNN architecture. In the second approach, we use RGB-MHI model features directly with the features of a 3D-CNN model using a late fusion technique. We perform extensive experiments on two recently released large-scale isolated sign language datasets, namely AUTSL and BosphorusSign22k. Our experiments show that our models, which use only RGB data, can compete with the state-of-the-art models in the literature that use multi-modal data.