Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic
This addresses the challenge of understanding model learning processes in multimodal systems, which is incremental as it builds on existing methods for analysis.
The paper tackles the problem of disentangling what visual captioning models learn during fine-tuning from their pre-trained knowledge by using Hybrid Markov Logic Networks to quantify the influence of training examples, finding that BLIP2 (which uses a large language model) shows smaller fine-tuning influence compared to non-LLM models on the MSCOCO dataset.
Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM