Multi-modal Cooking Workflow Construction for Food Recipes
This addresses the challenge of automating recipe understanding for applications like cooking assistants, though it is incremental as it builds on prior work by adding multi-modal data and neural methods.
The paper tackled the problem of constructing cooking workflow graphs from recipes by introducing MM-ReS, a large-scale multi-modal dataset with 9,850 recipes and human-labeled graphs, and proposed a neural encoder-decoder model that achieved over 20% performance gain over existing baselines.
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which achieved over 20% performance gain over existing hand-crafted baselines.