CVAILGJun 15, 2022

Test-Time Adaptation for Visual Document Understanding

arXiv:2206.07240v27 citationsh-index: 45
Originality Incremental advance
AI Analysis

This addresses the challenge of domain adaptation in visual document understanding for tasks like entity recognition and question answering, representing an incremental advance in test-time adaptation methods.

The paper tackles the problem of adapting visual document understanding models to distribution shifts at test-time, proposing DocTTA, which achieves improvements of up to 1.89% in F1 score for entity recognition, 3.43% in F1 score for key-value extraction, and 17.68% in ANLS score for document visual question answering.

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively. Our benchmark datasets are available at \url{https://saynaebrahimi.github.io/DocTTA.html}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes