CVSep 10, 2024Code
MIP-GAF: A MLLM-annotated Benchmark for Most Important Person Localization and Group Context UnderstandingSurbhi Madan, Shreya Ghosh, Lownish Rai Sookha et al.
Estimating the Most Important Person (MIP) in any social event setup is a challenging problem mainly due to contextual complexity and scarcity of labeled data. Moreover, the causality aspects of MIP estimation are quite subjective and diverse. To this end, we aim to address the problem by annotating a large-scale `in-the-wild' dataset for identifying human perceptions about the `Most Important Person (MIP)' in an image. The paper provides a thorough description of our proposed Multimodal Large Language Model (MLLM) based data annotation strategy, and a thorough data quality analysis. Further, we perform a comprehensive benchmarking of the proposed dataset utilizing state-of-the-art MIP localization methods, indicating a significant drop in performance compared to existing datasets. The performance drop shows that the existing MIP localization algorithms must be more robust with respect to `in-the-wild' situations. We believe the proposed dataset will play a vital role in building the next-generation social situation understanding methods. The code and data is available at https://github.com/surbhimadan92/MIP-GAF.
CVDec 9, 2025
A Survey of Body and Face Motion: Datasets, Performance Evaluation Metrics and Generative TechniquesLownish Rai Sookha, Nikhil Pakhale, Mudasir Ganaie et al.
Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on https://lownish23csz0010.github.io/mogen/.
LGMar 16, 2025Code
MAVEN: Multi-modal Attention for Valence-Arousal Emotion NetworkVrushank Ahire, Kunal Shah, Mudasir Nazir Khan et al.
Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW