CLCVLGMay 12

DocAtlas: Multilingual Document Understanding Across 80+ Languages

arXiv:2605.1262395.4
Predicted impact top 30% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For multilingual document understanding, DocAtlas provides a scalable data generation framework and demonstrates that DPO enables stable multilingual adaptation without degrading performance on high-resource languages.

DocAtlas constructs high-fidelity OCR datasets and benchmarks for 82 languages, revealing persistent gaps in low-resource scripts. Using Direct Preference Optimization (DPO), it improves in-domain accuracy by 1.9% and out-of-domain accuracy by 1.8% without base-language degradation, with the best variant outperforming the strongest baseline by 1.7%.

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes