JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models
This addresses a bottleneck in developing precise flowchart understanding for vision and language models, particularly for Japanese business applications, but is incremental as it focuses on dataset creation rather than a new model or method.
The authors tackled the lack of large-scale datasets for training vision and language models to understand Japanese flowcharts by introducing JSynFlow, a synthesised visual question-answering dataset generated using large language models, and demonstrated that fine-tuning with it significantly improves model performance on flowchart-based QA tasks.
Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.