IRSep 9, 2016

Extraction of Layout Entities and Sub-layout Query-based Retrieval of Document Images

Anukriti Bansal, Sumantra Dutta Roy, Gaurav Harit

arXiv:1609.02687v12.7

Originality Incremental advance

AI Analysis

This addresses the challenge of document retrieval when textual content is unavailable or irrelevant, offering a practical solution for domains like archival or newspaper management, though it appears incremental in improving existing layout-based methods.

The paper tackles the problem of retrieving document images based on structural layout and sub-layout queries, proposing a graph-based matching algorithm with hash-based indexing to efficiently search large databases, and reports promising results on a dataset of 4776 newspaper images.

Layouts and sub-layouts constitute an important clue while searching a document on the basis of its structure, or when textual content is unknown/irrelevant. A sub-layout specifies the arrangement of document entities within a smaller portion of the document. We propose an efficient graph-based matching algorithm, integrated with hash-based indexing, to prune a possibly large search space. A user can specify a combination of sub-layouts of interest using sketch-based queries. The system supports partial matching for unspecified layout entities. We handle cases of segmentation pre-processing errors (for text/non-text blocks) with a symmetry maximization-based strategy, and accounting for multiple domain-specific plausible segmentation hypotheses. We show promising results of our system on a database of unstructured entities, containing 4776 newspaper images.

View on arXiv PDF

Similar