LGFeb 21, 2025

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz

arXiv:2502.15678v23 citationsh-index: 13ICML

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing human-like visual cognition in AI models, but it is incremental as it shows limited generalization beyond fine-tuned tasks.

The researchers tackled the problem of improving visual cognition in vision language models by fine-tuning them on intuitive physics and causal reasoning tasks, finding that while performance improved in the specific domains and alignment with human behavior increased, it did not lead to robust generalization to other visual characteristics or cognitive domains.

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

View on arXiv PDF

Similar