LGAINov 3, 2025

Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

arXiv:2511.01694v11 citationsh-index: 2ICDM
Originality Incremental advance
AI Analysis

This addresses the problem of efficient and robust fine-tuning for vision-language models when labeled data is scarce, representing an incremental improvement with a novel combination of techniques.

The paper tackles the challenge of few-shot fine-tuning for CLIP models on both in-distribution and out-of-distribution datasets by proposing a Bayesian approximation of Natural Gradient Descent using Kalman filtering. The method achieves superior or comparable in-distribution performance and improved out-of-distribution robustness compared to state-of-the-art baselines.

Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes