CVCLDLLGJan 16

PubMed-OCR: PMC Open Access OCR Annotations

arXiv:2601.11425v1h-index: 3
Originality Synthesis-oriented
AI Analysis

This dataset facilitates downstream research in layout-aware modeling and OCR-dependent pipelines for the scientific community, though it is incremental as it builds on existing PDFs and OCR tools.

The authors tackled the lack of a large-scale, annotated OCR dataset for scientific articles by creating PubMed-OCR, a corpus of 209.5K articles (1.5M pages, ~1.3B words) with word-, line-, and paragraph-level bounding boxes from PubMed Central PDFs.

PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes