olmOCR 2: Unit Test Rewards for Document OCR
This addresses the challenge of accurate OCR for complex document layouts, benefiting users handling digitized documents, though it is incremental as an update to a previous system.
The paper tackles the problem of converting digitized print documents into clean text by introducing olmOCR 2, a 7B vision language model trained with reinforcement learning using binary unit tests, achieving state-of-the-art performance on the olmOCR-Bench benchmark with significant improvements in math formula conversion, table parsing, and multi-column layouts.
We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.