CVMar 25, 2025

fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

Saurav Sharma, Didier Mutter, Nicolas Padoy

arXiv:2503.19670v11 citationsh-index: 32

Originality Incremental advance

AI Analysis

This addresses the challenge of fine-grained surgical activity recognition for medical AI applications, but it is incremental as it builds upon existing CLIP models with domain-specific adaptations.

The paper tackled the problem of zero-shot fine-grained surgical action recognition, where vision-language models like CLIP struggle with action triplets due to reliance on global features and lack of hierarchical modeling. The result was significant improvements in F1 and mAP on the CholecT50 dataset, enhancing recognition of novel surgical triplets.

While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and leverages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

View on arXiv PDF

Similar