CVMay 26, 2025

LlamaSeg: Image Segmentation via Autoregressive Mask Generation

Jiru Deng, Tengjin Weng, Tianyu Yang, Wenhan Luo, Zhiheng Li, Wenhao Jiang

arXiv:2505.19422v1h-index: 2

Originality Incremental advance

AI Analysis

This work addresses the challenge of flexible, open-vocabulary image segmentation for computer vision applications, representing an incremental advancement by integrating segmentation into autoregressive architectures.

The authors tackled the problem of unifying multiple image segmentation tasks by reformulating segmentation as a visual generation problem, resulting in a model that surpasses existing generative models and produces more detailed masks across multiple datasets.

We present LlamaSeg, a visual autoregressive framework that unifies multiple image segmentation tasks via natural language instructions. We reformulate image segmentation as a visual generation problem, representing masks as "visual" tokens and employing a LLaMA-style Transformer to predict them directly from image inputs. By adhering to the next-token prediction paradigm, our approach naturally integrates segmentation tasks into autoregressive architectures. To support large-scale training, we introduce a data annotation pipeline and construct the SA-OVRS dataset, which contains 2M segmentation masks annotated with over 5,800 open-vocabulary labels or diverse textual descriptions, covering a wide spectrum of real-world scenarios. This enables our model to localize objects in images based on text prompts and to generate fine-grained masks. To more accurately evaluate the quality of masks produced by visual generative models, we further propose a composite metric that combines Intersection over Union (IoU) with Average Hausdorff Distance (AHD), offering a more precise assessment of contour fidelity. Experimental results demonstrate that our method surpasses existing generative models across multiple datasets and yields more detailed segmentation masks.

View on arXiv PDF

Similar