Food Image Generation on Multi-Noun Categories
This addresses a specific challenge in food image generation for real-world datasets like UEC-256, but it is incremental as it builds on existing generative models.
The paper tackled the problem of generating realistic food images for multi-noun categories, where models often misinterpret semantics and produce incorrect ingredients or layouts, and proposed FoCULR to incorporate food domain knowledge and refine layouts, resulting in improved image generation performance.
Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.