Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset
This work addresses the challenge of processing hybrid documents for applications like financial analysis, though it is incremental in improving existing LLM capabilities.
The authors tackled the problem of extracting information from hybrid long documents (HLDs) containing both text and tables, which exceed LLM token limits, by developing an Automated Information Extraction (AIE) framework and introducing the FINE dataset, achieving adaptability in complex scenarios and identifying effective summarization and table serialization methods.
Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.