CVSep 23, 2025

Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

arXiv:2509.19028v21 citationsh-index: 18SMAP
Originality Incremental advance
AI Analysis

This incremental work addresses food image annotation and nutrition tracking applications by reducing the need for pixel-level labels.

The paper tackles weakly supervised semantic segmentation for food images by combining Vision Transformers and the Segment Anything Model, achieving an mIoU of 0.54 on the FoodSeg103 dataset with 2.4 masks per image.

In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes