CVLGMar 11, 2023

Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models

arXiv:2303.12748v418 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses calibration issues for users of large vision-language models, but is incremental as it adapts existing methods to a new context.

The paper tackled the problem of miscalibration in zero-shot inference of vision-language models like CLIP, and found that a modified temperature scaling method generalizes across datasets and prompts with a single learned temperature.

Calibration of deep learning models is crucial to their trustworthiness and safe usage, and as such, has been extensively studied in supervised classification models, with methods crafted to decrease miscalibration. However, there has yet to be a comprehensive study of the calibration of vision-language models that are used for zero-shot inference, like CLIP. We measure calibration across relevant variables like prompt, dataset, and architecture, and find that zero-shot inference with CLIP is miscalibrated. Furthermore, we propose a modified version of temperature scaling that is aligned with the common use cases of CLIP as a zero-shot inference model, and show that a single learned temperature generalizes for each specific CLIP model (defined by a chosen pre-training dataset and architecture) across inference dataset and prompt choice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes