CVDec 22, 2023

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

arXiv:2312.14991v250 citationsh-index: 58Has CodeIEEE transactions on multimedia
Originality Incremental advance
AI Analysis

This work addresses the need for specialized AI assistants in the food domain, offering an incremental improvement over existing models.

The paper tackles the problem of poor performance of general large multi-modal models in the food domain by proposing FoodLMM, a versatile assistant that achieves state-of-the-art results on multiple food benchmarks, including tasks like recognition, recipe generation, and segmentation.

Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes