CVDec 2, 2025
Generalizing Vision-Language Models with Dedicated Prompt GuidanceXinyao Li, Yinjie Min, Hongbo Chen et al.
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
MEMay 2
Stable Localized Conformal Prediction via TransductionYinjie Min, Liuhua Peng, Changliang Zou
Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, prediction sets often exhibit high variability in size, especially for methods with localization. We formalize this concern as set stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Conformal Prediction (StCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of StCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods, especially for those with localization, when calibration data are limited.
LGFeb 5, 2025
LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-TuningZhekai Du, Yinjie Min, Jingjing Li et al.
Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the hypothesis space. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (iDCT) with selective locations of learnable components. We begin with a comprehensive theoretical comparison between frequency-domain and low-rank decompositions for fine-tuning pre-trained large models. Our analysis reveals that frequency-domain decomposition with carefully selected frequency components can surpass the expressivity of traditional low-rank-based methods. Furthermore, we demonstrate that iDCT offers a more efficient implementation compared to inverse Discrete Fourier Transform (iDFT), allowing for better selection and tuning of frequency components while maintaining equivalent expressivity to the optimal iDFT-based adaptation. By employing finite-difference approximation to estimate gradients for discrete locations of learnable coefficients on the DCT spectrum, LoCA dynamically selects the most informative frequency components during training. Experiments on diverse language and vision fine-tuning tasks demonstrate that LoCA offers enhanced parameter efficiency while maintains computational feasibility comparable to low-rank-based methods.