CVSep 9, 2024

Evaluating Multiview Object Consistency in Humans and Image Models

Berkeley
arXiv:2409.05862v216 citationsh-index: 111
Originality Synthesis-oriented
AI Analysis

This work addresses the gap in benchmarking human-model consistency in vision for researchers in cognitive science and AI, though it is incremental as it applies existing methods to new data.

The authors tackled the problem of evaluating how well vision models align with human perception on 3D shape inference tasks, finding that humans significantly outperform models like DINOv2 and CLIP, with humans showing better performance on challenging trials.

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes