CVJul 10, 2025

EPIC: Efficient Prompt Interaction for Text-Image Classification

arXiv:2507.07415v1h-index: 9ICME
Originality Incremental advance
AI Analysis

This addresses the problem of computational inefficiency for researchers and practitioners using large multimodal models, though it is incremental as it builds on existing prompt-based methods.

The paper tackles the high computational cost of fine-tuning large multimodal models for text-image classification by proposing EPIC, an efficient prompt-based interaction strategy that reduces trainable parameters to about 1% of the foundation model while achieving superior performance on datasets like UPMC-Food101 and SNLI-VE.

In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes