CLOct 17, 2024

BQA: Body Language Question Answering Dataset for Video Large Language Models

Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

arXiv:2410.13206v34.23 citationsh-index: 14ACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of improving nonverbal communication understanding in AI for applications like human-computer interaction, though it is incremental as it focuses on dataset creation and evaluation.

The authors tackled the challenge of enabling Video Large Language Models (VideoLLMs) to interpret body language by proposing the BQA dataset, which includes 26 emotion labels from short video clips, and found that current models struggle with this task and exhibit biases based on age and ethnicity.

A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made significantly biased answers depending on the age group and ethnicity of the individuals in the video. The dataset is available.

View on arXiv PDF

Similar