DocRevive: A Unified Pipeline for Document Text Restoration
This work addresses the underexplored problem of document text restoration for archival research and digital preservation, but the approach is incremental as it combines existing techniques without introducing a fundamentally new paradigm.
DocRevive introduces a unified pipeline combining OCR, image analysis, masked language modeling, and diffusion models to restore damaged or occluded text in documents, achieving semantically coherent reconstruction while preserving visual integrity. The pipeline is evaluated on a synthetic dataset of 30,078 degraded images, and a new Unified Context Similarity Metric (UCSM) is proposed for evaluation.
In Document Understanding, the challenge of reconstructing damaged, occluded, or incomplete text remains a critical yet unexplored problem. Subsequent document understanding tasks can benefit from a document reconstruction process. In response, this paper presents a novel unified pipeline combining state-of-the-art Optical Character Recognition (OCR), advanced image analysis, masked language modeling, and diffusion-based models to restore and reconstruct text while preserving visual integrity. We create a synthetic dataset of 30{,}078 degraded document images that simulates diverse document degradation scenarios, setting a benchmark for restoration tasks. Our pipeline detects and recognizes text, identifies degradation with an occlusion detector, and uses an inpainting model for semantically coherent reconstruction. A diffusion-based module seamlessly reintegrates text, matching font, size, and alignment. To evaluate restoration quality, we propose a Unified Context Similarity Metric (UCSM), incorporating edit, semantic, and length similarities with a contextual predictability measure that penalizes deviations when the correct text is contextually obvious. Our work advances document restoration, benefiting archival research and digital preservation while setting a new standard for text reconstruction. The OPRB dataset and code are available at \href{https://huggingface.co/datasets/kpurkayastha/OPRB}{Hugging Face} and \href{https://github.com/kunalpurkayastha/DocRevive}{Github} respectively.