CVAIJun 3

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

arXiv:2606.0480668.6
Predicted impact top 45% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For developers of LLM/agentic systems deployed in social environments, NoRA shifts evaluation from action selection to grounded justification, revealing a measurable gap in current VLMs' normative competence.

NoRA introduces a visual first-person video benchmark for normative action reasoning, requiring models to generate candidate actions and justify them via fact-reason-action support graphs. Evaluation of 12 multimodal systems shows VLMs can recover plausible actions and facts but fail to construct the full reasonable action space and bind actions to correct local support.

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes