Detecting expressions with multimodal transformers
This research provides incremental improvements in detecting user expressions for communal devices like Amazon Alexa, aiming to create more natural user experiences.
This study developed deep-learning algorithms for audio-visual detection of user expressions, specifically arousal and valence. Their proposed transformer architecture achieved absolute gains of approximately 2% for arousal and valence descriptors compared to a recurrent baseline, and up to 3.6% improvement over single-modality models.
Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.