CLCVIRMar 5, 2015

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

arXiv:1503.01558v3164 citations
AI Analysis

This work addresses the challenge of interpreting cooking videos for applications like recipe illustration and video search, but it is incremental as it builds on existing methods in a specific domain.

The paper tackles the problem of aligning recipe instructions to cooking videos by combining speech transcripts and visual food detection, achieving better performance than keyword spotting methods.

We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes