CVAug 4, 2025

Generating Synthetic Invoices via Layout-Preserving Content Replacement

arXiv:2508.03754v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of data scarcity for researchers and practitioners in document intelligence, though it is incremental as it builds on existing OCR and LLM techniques.

The paper tackles the problem of limited datasets for automated invoice processing by introducing a pipeline that generates synthetic invoices with realistic content and preserved layouts, enabling the creation of large, varied corpora for training more robust document intelligence models.

The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes