CVMar 28, 2024

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

arXiv:2403.19786v25 citationsh-index: 31Int J Comput Assist Radiol Surg
Originality Incremental advance
AI Analysis

This provides a solution for surgical robotics by enabling gesture recognition without large annotated datasets, though it is incremental as it adapts existing methods to a specific domain.

The paper tackles the challenge of surgical gesture recognition across diverse procedures by developing a zero-shot prompt-based video encoder, which outperforms standard encoders and shows strong performance in zero-shot scenarios without task-specific retraining.

Purpose: In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so called "zero-shot" capability). In this paper we investigate the feasibility of latter option. Methods: Leveraging the Bridge-Prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion Bridge-Prompt and similar pre-trained+prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes