IR AI CVOct 31, 2022

Tables to LaTeX: structure and content extraction from scientific tables

Pratik Kayal, Mrinal Anand, Harsh Desai, Mayank Singh

arXiv:2210.17246v111.416 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the problem of automating table extraction for researchers and publishers, but it is incremental as it builds on existing transformer-based methods with specific adaptations for scientific tables.

The paper tackles the problem of extracting structure and content from scientific tables in PDF documents, which is challenging due to visual and content features like spanning cells and mathematical symbols. The result is a transformer-based model that converts tabular images to LaTeX source code, achieving exact match accuracies of 70.35% for structure and 49.69% for content extraction.

Scientific documents contain tables that list important information in a concise fashion. Structure and content extraction from tables embedded within PDF research documents is a very challenging task due to the existence of visual features like spanning cells and content features like mathematical symbols and equations. Most existing table structure identification methods tend to ignore these academic writing features. In this paper, we adapt the transformer-based language modeling paradigm for scientific table structure and content extraction. Specifically, the proposed model converts a tabular image to its corresponding LaTeX source code. Overall, we outperform the current state-of-the-art baselines and achieve an exact match accuracy of 70.35 and 49.69% on table structure and content extraction, respectively. Further analysis demonstrates that the proposed models efficiently identify the number of rows and columns, the alphanumeric characters, the LaTeX tokens, and symbols.

View on arXiv PDF

Similar