CVMar 12, 2024

Fine-grained Prompt Tuning: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification

Yijin Huang, Pujin Cheng, Roger Tam, Xiaoying Tang

arXiv:2403.07576v411.310 citationsh-index: 16Has CodeMICCAI

Originality Incremental advance

AI Analysis

This addresses the challenge of deploying large pre-trained models in resource-constrained medical imaging settings, though it is an incremental improvement over existing PETL methods.

The paper tackles the problem of high memory consumption in parameter-efficient transfer learning for medical image classification by proposing Fine-grained Prompt Tuning (FPT), which achieves comparable performance to full fine-tuning while using only 1.8% of learnable parameters and 13% of memory costs on a ViT-B model with 512x512 input resolution.

Parameter-efficient transfer learning (PETL) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PETL method for medical image classification. FPT significantly reduces memory consumption compared to other PETL methods, especially in high-resolution input contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.

View on arXiv PDF Code

Similar