A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents
This addresses a specific issue in telecom domain QA by enabling image retrieval without needing visual models at inference, though it is incremental as it builds on existing RAG and VLM methods.
The paper tackles the problem of answering questions from technical documents where answers are in flowcharts, by integrating graph representations from visual language models into a text-based retrieval system, achieving good retrieval performance and lower edit distance to ground truth in telecom documents.
Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.