CVOct 15, 2025

VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das

arXiv:2510.13808v13.6h-index: 5

Originality Incremental advance

AI Analysis

This addresses domain adaptation for VLMs, enabling better performance in novel domains like robotics and depth sensing, though it is incremental as it builds on existing probing techniques.

The paper tackles the problem of large Vision-Language Models (VLMs) degrading in performance under domain shifts by introducing VisCoP, which adds learnable visual probes to the vision encoder for efficient adaptation. Results show it outperforms existing methods across cross-view, cross-modal, and cross-task settings while retaining source-domain knowledge.

Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.

View on arXiv PDF

Similar