CVJun 3

Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

arXiv:2606.0498687.3
Predicted impact top 19% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in food computing, this work provides a new benchmark and a method that improves reasoning and generalization over supervised fine-tuning, though it is domain-specific and incremental.

Food-R1 introduces CalorieBench-80K, the first food benchmark with chain-of-thought calorie reasoning, and a multi-task VLM trained with reinforcement fine-tuning (GRPO) that outperforms baselines across food tasks.

Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes