Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction
This work addresses the need for efficient Arabic OCR and document digitization, providing open-source models and a large-scale dataset for researchers, though it is incremental as it builds on Meta's Nougat architecture.
The authors tackled the problem of converting Arabic book pages into structured Markdown text by fine-tuning vision transformers, achieving state-of-the-art performance with their arabic-large-nougat model, which delivered the highest Markdown Structure Accuracy and lowest Character Error Rate.
We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at https://github.com/MohamedAliRashad/arabic-nougat.