CVJul 29, 2025

Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance

arXiv:2507.21529v12 citationsh-index: 5MM
Originality Incremental advance
AI Analysis

This work addresses the specific problem of visualizing cooking steps for applications in food analysis and image generation, representing an incremental improvement over prior methods focused on finished foods.

The paper tackles the problem of cooking process visualization by generating images for each step of a recipe, addressing challenges of semantic inconsistency and contextual coherence. The proposed Chain-of-Cooking model, with modules like Dynamic Patch Selection and Bidirectional Chain-of-Thought Guidance, outperforms existing methods in generating coherent and semantically consistent images, as shown in quantitative and qualitative experiments.

Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges to visualize the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes