DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation
This work addresses engagement estimation in conversations, which is incremental as it builds on existing methods with specific architectural improvements.
The paper tackled conversational engagement estimation as a regression problem, introducing a dilated convolutional Transformer model that achieved a 7% improvement on the test set and 4% on the validation set over baseline models.
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.