CV CL LGApr 21, 2024

Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi

arXiv:2404.13530v13 citationsh-index: 5Has Code2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Originality Incremental advance

AI Analysis

This work improves video-based question answering for socially intelligent AI systems by better integrating multimodal information, though it is incremental in nature.

The paper tackles the challenge of Socially Intelligent Question Answering (SIQA) by addressing language overfitting and video modality bypassing, achieving state-of-the-art results with 82.06% accuracy on the Social IQ 2.0 dataset.

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at https://github.com/sts-vlcc/sts-vlcc

View on arXiv PDF Code

Similar