CVFeb 6

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

arXiv:2602.06619v11.5h-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of domain adaptation for surgical video understanding, which is critical for intelligent operating rooms, though it appears incremental as it builds on CLIP with causality-inspired modifications.

The paper tackles the problem of surgical phase recognition in videos, where limited annotated clinical data and domain gaps between synthetic and real data hinder robust training. The result is that their CauCLIP method substantially outperforms all competing approaches on the SurgVisDom benchmark.

Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

View on arXiv PDF

Similar