CVLGOct 30, 2025

Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition

arXiv:2510.26466v2h-index: 1
Originality Incremental advance
AI Analysis

This addresses debiased and reliable multimodal reasoning for vision-language models, offering a practical causal approach, though it is incremental as it builds on existing CLIP frameworks.

The paper tackles the problem of object-context shortcuts in vision-language models that reduce zero-shot reliability by recasting it as a causal inference problem and synthesizing counterfactual embeddings to mitigate biases. The method improves worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art without retraining or prompt design.

Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes