CL LGFeb 18, 2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka

arXiv:2102.09550v313.5197 citationsh-index: 9

Originality Highly original

AI Analysis

This addresses the challenge of understanding complex documents with layout and visual elements for applications in document analysis and question-answering, representing a novel method rather than an incremental improvement.

The paper tackles the problem of natural language comprehension beyond plain-text documents by introducing the TILT neural network architecture, which achieves state-of-the-art results in extracting information from documents and answering layout-dependent questions on datasets like DocVQA, CORD, and SROIE.

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

View on arXiv PDF

Similar