CLCVMay 29, 2023

Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

arXiv:2305.18641v120 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of building cross-modal intelligence for chart understanding, which is important for applications in data analysis and communication, though it appears incremental as it builds on existing pre-training methods with novel objectives.

The paper tackles the problem of automatic chart understanding by introducing ChartT5, a vision and language model that learns to interpret table information from chart images through cross-modal pre-training on plot table pairs, achieving over 8% performance gains on the ChartQA benchmark compared to state-of-the-art non-pretraining methods.

Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language(V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes