CV AI CLFeb 27, 2025

EgoNormia: Benchmarking Physical Social Norm Understanding

MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang

Georgia Tech

arXiv:2502.20490v516.412 citationsh-index: 15Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the challenge of evaluating and improving physical social norm understanding in vision-language models, which is crucial for deploying safe and effective AI agents in human interactions, though it is incremental as it builds on existing dataset and method frameworks.

The paper tackles the problem of sparse supervision for normative reasoning in vision-language models by introducing EGONORMIA, a dataset of 1,853 multiple-choice questions grounded in egocentric videos, and finds that current state-of-the-art models score only up to 54% on it, indicating risks in real-world applications.

Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EGONORMIA $\|ε\|$, comprising 1,853 (200 for EGONORMIA-verified) multiple choice questions (MCQs) grounded within egocentric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating that a naive retrieval-based generation (RAG) method using EGONORMIA can enhance normative reasoning in VLMs.

View on arXiv PDF Code

Similar