You Have a Point There: Object Selection Inside an Automobile Using Gaze, Head Pose and Finger Pointing
This work aims to improve in-car object selection for drivers and passengers by combining multiple input modalities, representing an incremental improvement in human-computer interaction within vehicles.
This paper proposes a multimodal fusion method using gaze, head pose, and finger pointing, triggered by speech, to select control modules in a car. They show that fusing these inputs enhances pointing direction accuracy compared to single modalities, with deep learning outperforming conventional methods.
Sophisticated user interaction in the automotive industry is a fast emerging topic. Mid-air gestures and speech already have numerous applications for driver-car interaction. Additionally, multimodal approaches are being developed to leverage the use of multiple sensors for added advantages. In this paper, we propose a fast and practical multimodal fusion method based on machine learning for the selection of various control modules in an automotive vehicle. The modalities taken into account are gaze, head pose and finger pointing gesture. Speech is used only as a trigger for fusion. Single modality has previously been used numerous times for recognition of the user's pointing direction. We, however, demonstrate how multiple inputs can be fused together to enhance the recognition performance. Furthermore, we compare different deep neural network architectures against conventional Machine Learning methods, namely Support Vector Regression and Random Forests, and show the enhancements in the pointing direction accuracy using deep learning. The results suggest a great potential for the use of multimodal inputs that can be applied to more use cases in the vehicle.