Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
This addresses the challenge of cultural reasoning in AI for applications like cultural heritage, but it is incremental as it builds on existing vision-language models.
The paper tackled the problem of inferring structured cultural metadata from images, which is underexplored, and found that current vision-language models show substantial performance variation across cultures and metadata types, leading to inconsistent predictions.
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.