CVCLMMAug 5, 2024

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

arXiv:2408.02272v14 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for vision-language research in procedural video understanding, though it is incremental as it focuses on a specific domain (cooking videos).

The authors introduced COM Kitchens, a dataset of unedited overhead-view cooking videos captured by smartphones, to address the challenge of querying instructional content from raw videos, and proposed new tasks like Online Recipe Retrieval and Dense Video Captioning, with experiments showing current SOTA methods have limitations on these tasks.

Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes