CLMay 15

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

Michał Ciesiółka, Dawid Wiśniewski, Adrian Charkiewicz, Kamil Guttmann

arXiv:2605.1579485.5

Predicted impact top 50% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers in multimodal machine translation, this dataset provides a benchmark for developing layout-aware translation models that preserve document structure.

The authors introduce ForMaT, a parallel corpus of 3,956 PDFs across 15 language pairs with preserved layout metadata, and show that current MT systems fail to maintain spatial grounding and geometric synchronization, highlighting the need for layout-aware models.

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

View on arXiv PDF

Similar