DA-Mamba: Dialogue-aware selective state-space model for multimodal engagement estimation
This work addresses engagement estimation for applications like adaptive tutoring and remote healthcare, offering incremental improvements in efficiency and performance for resource-constrained settings.
The paper tackled the problem of estimating human engagement in conversational scenarios by introducing DA-Mamba, a dialogue-aware multimodal architecture that uses Mamba-based selective state-space processing to achieve linear time and memory complexity, and it surpassed prior state-of-the-art methods in concordance correlation coefficient on three benchmarks while reducing training time and memory usage.
Human engagement estimation in conversational scenarios is essential for applications such as adaptive tutoring, remote healthcare assessment, and socially aware human--computer interaction. Engagement is a dynamic, multimodal signal conveyed by facial expressions, speech, gestures, and behavioral cues over time. In this work we introduce DA-Mamba, a dialogue-aware multimodal architecture that replaces attention-heavy dialogue encoders with Mamba-based selective state-space processing to achieve linear time and memory complexity while retaining expressive cross-modal reasoning. We design a Mamba dialogue-aware selective state-space model composed of three core modules: a Dialogue-Aware Encoder, and two Mamba-based fusion mechanisms: Modality-Group Fusion and Partner-Group Fusion, these modules achieve expressive dialogue understanding. Extensive experiments on three standard benchmarks (NoXi, NoXi-Add, and MPIIGI) show that DA-Mamba surpasses prior state-of-the-art (SOTA) methods in concordance correlation coefficient (CCC), while reducing training time and peak memory; these gains enable processing much longer sequences and facilitate real-time deployment in resource-constrained, multi-party conversational settings. The source code will be available at: https://github.com/kksssssss-ssda/MMEA.