CLJan 8, 2025

Graph-Based Multimodal Contrastive Learning for Chart Question Answering

arXiv:2501.04303v24 citationsh-index: 21SIGIR
Originality Incremental advance
AI Analysis

This addresses the problem of interpreting complex charts for question answering, which is important for data analysis applications, though it appears to be an incremental advance building on existing multimodal and graph-based methods.

The paper tackles chart question answering by introducing a joint multimodal scene graph framework that models relationships among chart components and uses graph contrastive learning to align visual and textual representations, achieving significant performance improvements on benchmarks like ChartQA, OpenCQA, and ChartX.

Chart question answering (ChartQA) is challenged by the heterogeneous composition of chart elements and the subtle data patterns they encode. This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures. The framework integrates both visual and textual graphs to capture structural and semantic characteristics, while a graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts. Moreover, a set of tailored Chain of Thought (CoT) prompts is proposed to enhance multimodal large language models (MLLMs) in zero-s ot scenarios by mitigating hallucinations. Extensive evaluations on benchmarks including ChartQA, OpenCQA, and ChartX demonstrate significant performance improvements and validate the efficacy of the proposed approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes