CVDec 17, 2023

p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models

Haoyuan Wu, Xinyun Zhang, Peng Xu, Peiyu Liao, Xufeng Yao, Bei Yu

arXiv:2312.10613v11.51 citationsh-index: 8Has CodeAAAI

Originality Highly original

AI Analysis

This work addresses the problem of efficient adaptation of large vision-language models for researchers and practitioners, offering a novel method that is incremental in improving adapter tuning for heterophilic graphs.

The paper tackles the challenge of adapting large pre-trained vision-language models efficiently by proposing a novel adapter architecture, $p$-adapter, which uses $p$-Laplacian message passing to handle heterophilic graphs in attention mechanisms, achieving significant superiority over other parameter-efficient transfer learning methods in tasks like visual question answering, visual entailment, and image captioning.

Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks. In light of the rapidly increasing size of pre-trained VLMs, parameter-efficient transfer learning (PETL) has garnered attention as a viable alternative to full fine-tuning. One such approach is the adapter, which introduces a few trainable parameters into the pre-trained models while preserving the original parameters during adaptation. In this paper, we present a novel modeling framework that recasts adapter tuning after attention as a graph message passing process on attention graphs, where the projected query and value features and attention matrix constitute the node features and the graph adjacency matrix, respectively. Within this framework, tuning adapters in VLMs necessitates handling heterophilic graphs, owing to the disparity between the projected query and value space. To address this challenge, we propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs). Specifically, the attention weights are re-normalized based on the features, and the features are then aggregated using the calibrated attention matrix, enabling the dynamic exploitation of information with varying frequencies in the heterophilic attention graphs. We conduct extensive experiments on different pre-trained VLMs and multi-modal tasks, including visual question answering, visual entailment, and image captioning. The experimental results validate our method's significant superiority over other PETL methods.

View on arXiv PDF Code

Similar