CLAICVDec 4, 2025

KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

arXiv:2512.11849v11 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This addresses a critical gap for document AI tools in Khmer, a language with over 17 million speakers, by enabling better processing of business documents for public and private use, though it is incremental as it extends existing annotation frameworks to a new language.

The authors tackled the lack of resources for automated layout analysis of low-resource Khmer business documents by creating KH-FUNSD, a hierarchically annotated dataset, and provided baseline results for models on this new benchmark.

Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes