CVIVMay 19, 2025

ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

arXiv:2505.13746v1
Originality Synthesis-oriented
AI Analysis

This work addresses surgical workflow analysis for medical applications, but it is incremental as it adapts existing vision-language models to a specific domain.

The authors tackled the problem of surgical phase recognition from video by proposing a representation learning method using a vision-language model, which outperformed conventional methods on three datasets.

Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes