Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images
This addresses the problem of slow and error-prone manual table extraction for document analysis, though it is incremental as it builds on existing methods like TableNet.
The paper tackled table detection in document images by creating a synthetic data augmentation pipeline using LaTeX to generate realistic two-column pages with tables, achieving a pixel-wise XOR error of 4.04% on synthetic data and 9.18% on the real-world Marmot benchmark.
Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.