AIMay 13

Multimodal Hidden Markov Models for Persistent Emotional State Tracking

arXiv:2605.1283833.7
Predicted impact top 86% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in affective computing and clinical conversational AI, this provides an interpretable and efficient method for tracking persistent emotional states, though it is an incremental application of existing HMM variants.

The paper proposes a lightweight framework using sticky factorial HDP-HMMs to model conversational emotion as latent regimes from multimodal valence-arousal trajectories, achieving more interpretable regime sequences than Gaussian HMMs at lower computational cost than LLM-based methods, with improved LLM response quality in clinical settings.

Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes