CVJan 8

How Does India Cook Biryani?

Shubham Goel, Farzana S, C V Rishi, Aditya Arun, C V Jawahar

Berkeley

arXiv:2601.06198v11.5h-index: 8

Originality Synthesis-oriented

AI Analysis

This work provides a novel testbed for evaluating vision-language models on structured, multimodal reasoning tasks, with potential applications in computational cultural heritage analysis, though it is incremental in applying existing methods to a new domain.

The authors tackled the problem of analyzing regional variations in biryani preparation using computational tools by creating a large-scale dataset of 120 videos across 12 styles and developing a multi-stage framework with vision-language models to segment, align, and compare procedural differences, resulting in a new QA benchmark for evaluating VLMs on multimodal reasoning tasks.

Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos. We release all data, code, and the project website at https://farzanashaju.github.io/how-does-india-cook-biryani/.

View on arXiv PDF

Similar