End-to-end Document Recognition and Understanding with Dessurt
This work addresses the need for flexible and integrated document recognition and understanding systems, but it appears incremental as it builds on existing transformer architectures without claiming major breakthroughs.
The paper tackles the problem of document understanding by introducing Dessurt, an end-to-end transformer that processes document images and task strings to generate text, eliminating the need for external recognition models. It demonstrates effectiveness across 9 dataset-task combinations, though no specific performance numbers are provided.
We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to the document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.