CVAICLFeb 27, 2025

EgoNormia: Benchmarking Physical Social Norm Understanding

Georgia Tech
arXiv:2502.20490v59 citationsh-index: 15ACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of evaluating and improving physical social norm understanding in vision-language models, which is crucial for deploying safe and effective AI agents in human interactions, though it is incremental as it builds on existing dataset and method frameworks.

The paper tackles the problem of sparse supervision for normative reasoning in vision-language models by introducing EGONORMIA, a dataset of 1,853 multiple-choice questions grounded in egocentric videos, and finds that current state-of-the-art models score only up to 54% on it, indicating risks in real-world applications.

Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EGONORMIA $\|ε\|$, comprising 1,853 (200 for EGONORMIA-verified) multiple choice questions (MCQs) grounded within egocentric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). EGONORMIA spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EGONORMIA and 65% on EGONORMIA-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating that a naive retrieval-based generation (RAG) method using EGONORMIA can enhance normative reasoning in VLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes