Test-Time Adaptation for Visual Document Understanding
This addresses the challenge of domain adaptation in visual document understanding for tasks like entity recognition and question answering, representing an incremental advance in test-time adaptation methods.
The paper tackles the problem of adapting visual document understanding models to distribution shifts at test-time, proposing DocTTA, which achieves improvements of up to 1.89% in F1 score for entity recognition, 3.43% in F1 score for key-value extraction, and 17.68% in ANLS score for document visual question answering.
For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively. Our benchmark datasets are available at \url{https://saynaebrahimi.github.io/DocTTA.html}.