VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
This addresses the challenge of equipping MLLMs with domain expertise for cultural-heritage artifact analysis, providing a reusable resource for the AI and archaeology communities, though it is incremental as it builds on existing SFT and RL techniques.
The researchers tackled the problem of enabling multimodal large language models to perform robust, expert-level reasoning for ancient Greek pottery analysis by developing VaseVL, an SFT-then-RL system that uses a taxonomy of question types to identify and optimize performance gaps. Their approach achieved state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, and they released the VaseVQA benchmark with 31,773 images for future research.
Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.