CVJan 30, 2025

Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations

arXiv:2501.18474v13 citationsh-index: 25
Originality Incremental advance
AI Analysis

This addresses the costly and time-intensive annotation problem in medical imaging, though it is incremental as it builds on existing foundation models and test-time training concepts.

The paper tackled the performance gap between vision foundation models and task-specific models in segmentation by introducing a test-time training method that uses point prompts and augmentations to improve performance without full annotations, achieving an average Dice coefficient of 0.868 on a new VFSS-5k dataset.

Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task-specific, specialized models. Fine-tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS-5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes