CLJun 16, 2024

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

arXiv:2406.11030v230 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in cultural AI research by providing a dataset for fine-grained understanding of Chinese food culture, though it is incremental as it focuses on a specific domain.

The authors tackled the lack of datasets capturing regional diversity in food culture by introducing FoodieQA, a fine-grained multimodal dataset for Chinese food, and found that while LLMs excel in text-based tasks, open-sourced VLMs lag significantly in visual question-answering tasks, with gaps of 41% and 21% compared to human accuracy.

Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes