CVIRDec 3, 2016

Mining Spatio-temporal Data on Industrialization from Historical Registries

arXiv:1612.00992v116 citations
Originality Incremental advance
AI Analysis

This provides a scalable method for historians and environmental researchers to analyze socioenvironmental phenomena, though it is incremental as it builds on existing OCR and layout analysis techniques.

The researchers tackled the problem of extracting structured historical data from printed directories by developing a data-mining pipeline that integrates page layout analysis and OCR, resulting in geocoded spatio-temporal data on industrial land use in Rhode Island over 41 years, which revealed evidence of manufacturing dispersion from Providence along the Interstate 95 corridor.

Despite the growing availability of big data in many fields, historical data on socioevironmental phenomena are often not available due to a lack of automated and scalable approaches for collecting, digitizing, and assembling them. We have developed a data-mining method for extracting tabulated, geocoded data from printed directories. While scanning and optical character recognition (OCR) can digitize printed text, these methods alone do not capture the structure of the underlying data. Our pipeline integrates both page layout analysis and OCR to extract tabular, geocoded data from structured text. We demonstrate the utility of this method by applying it to scanned manufacturing registries from Rhode Island that record 41 years of industrial land use. The resulting spatio-temporal data can be used for socioenvironmental analyses of industrialization at a resolution that was not previously possible. In particular, we find strong evidence for the dispersion of manufacturing from the urban core of Providence, the state's capital, along the Interstate 95 corridor to the north and south.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes