CVAug 28, 2025

Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

arXiv:2508.20830v18.42 citationsh-index: 17DEMI@MICCAI

Originality Incremental advance

AI Analysis

This work addresses the challenge of overfitting in small-scale medical datasets for surgical tool keypoint detection, which is incremental as it adapts existing VLMs to a specific domain.

This paper tackles the problem of 2D keypoint estimation for surgical tools by fine-tuning Vision-Language Models with Low-Rank Adaptation, achieving improved performance over baseline models with only two epochs of fine-tuning in low-resource medical datasets.

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

View on arXiv PDF

Similar