PubMed-OCR: PMC Open Access OCR Annotations
This dataset facilitates downstream research in layout-aware modeling and OCR-dependent pipelines for the scientific community, though it is incremental as it builds on existing PDFs and OCR tools.
The authors tackled the lack of a large-scale, annotated OCR dataset for scientific articles by creating PubMed-OCR, a corpus of 209.5K articles (1.5M pages, ~1.3B words) with word-, line-, and paragraph-level bounding boxes from PubMed Central PDFs.
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.