CVJan 31, 2024

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao

arXiv:2401.17904v220.942 citationsh-index: 43Has CodeIEEE Trans Pattern Anal Mach Intell

Originality Incremental advance

AI Analysis

This work addresses the problem of hierarchical text segmentation and layout analysis for document analysis and computer vision applications, representing an incremental advancement by adapting a foundation model to a specific domain.

The paper tackles hierarchical text segmentation by introducing Hi-SAM, a model that leverages the Segment Anything Model to segment text at pixel, word, text-line, and paragraph levels, achieving state-of-the-art results with 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level segmentation, and improvements of 4.73% PQ and 5.39% F1 on text-line level and 5.49% PQ and 7.39% F1 on paragraph level layout analysis.

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

View on arXiv PDF Code

Similar