Learning Program Representations for Food Images and Cooking Recipes
This addresses the challenge of creating interpretable and manipulable representations for instructional procedures like cooking, benefiting users and agents in AI applications, though it is incremental in applying programmatic structures to a specific domain.
The paper tackles the problem of modeling cooking recipes and food images by representing them as structured cooking programs, which improves cross-modal retrieval, recognition, and image generation tasks, with results showing better performance in these areas.
In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured representation of the task, capturing cooking semantics and sequential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition results compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online.