IRFeb 20, 2020

Processing topical queries on images of historical newspaper pages

arXiv:2002.08500v11.6

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of accessing and analyzing historical newspaper images for researchers in human and social sciences, but it appears incremental as it builds on existing segmentation and topic extraction methods.

The paper tackles the challenge of machine-reading historical newspaper images by developing a topic navigation system with four modules for text extraction and topic modeling, and presents initial test results on a 28-year collection.

Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.

View on arXiv PDF

Similar