CVOct 10, 2023

A General Protocol to Probe Large Vision Models for 3D Physical Understanding

arXiv:2310.06836v329 citationsh-index: 50
Originality Incremental advance
AI Analysis

This work addresses the need to evaluate 3D physical understanding in vision models for researchers and practitioners, though it is incremental as it builds on existing probing methods.

The paper tackles the problem of assessing how well large vision models understand 3D physical properties from images by introducing a protocol to probe features for properties like geometry and lighting, finding that Stable Diffusion and DINOv2 features perform best, with specific gains such as outperforming other models across all properties.

Our objective in this paper is to probe large vision models to determine to what extent they 'understand' different physical properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a general and lightweight protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene, by training discriminative classifiers on the features for these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures, and large vision models including CLIP, DINOv1, DINOv2, VQGAN, Stable Diffusion. (iii) We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties, including scene geometry, support relations, shadows and depth, but less performant for occlusion and material, while outperforming DINOv1, CLIP and VQGAN for all properties. (iv) It is observed that different time steps of Stable Diffusion features, as well as different transformer layers of DINO/CLIP/VQGAN, are good at different properties, unlocking potential applications of 3D physical understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes