CVAICLLGFeb 13, 2023

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

arXiv:2302.10282v1267 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating and enhancing 3D language grounding in vision-language models for researchers in AI and computer vision, though it is incremental as it builds on existing CLIP capabilities.

The paper investigated whether the CLIP model can ground perspective descriptions of 3D objects and identify canonical views, finding that pre-trained CLIP performed poorly but fine-tuning with hard negative sampling and random contrasting improved results with limited data.

Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results even under conditions with little available training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes