A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages
This work addresses the need for integrated information extraction in document images, offering a more efficient solution for applications like digitization and analysis, though it is incremental as it builds on existing deep learning architectures.
The authors tackled the problem of separate methods for text localization, transcription, and named entity recognition in document images by proposing an end-to-end neural model that performs all tasks jointly in a single feed-forward step, showing improved performance through shared features.
In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propose an end-to-end model that combines a one stage object detection network with branches for the recognition of text and named entities respectively in a way that shared features can be learned simultaneously from the training error of each of the tasks. By doing so the model jointly performs handwritten text detection, transcription, and named entity recognition at page level with a single feed forward step. We exhaustively evaluate our approach on different datasets, discussing its advantages and limitations compared to sequential approaches. The results show that the model is capable of benefiting from shared features for simultaneously solving interdependent tasks.