CVAIMar 26, 2024

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

arXiv:2403.17342v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work improves captioning for scientific figures, which aids researchers in understanding papers, but it is incremental as it builds on existing methods like PaddleOCR, LLaMA, and BRIO.

The authors tackled the problem of generating high-quality captions for scientific figures by summarizing textual content from papers, addressing OCR discrepancies and irrelevant text noise, and aligning generation with evaluation metrics. Their solution achieved first place in the ICCV 2023 challenge with a score of 4.49.

In this paper, we propose a solution for improving the quality of captions generated for figures in papers. We adopt the approach of summarizing the textual content in the paper to generate image captions. Throughout our study, we encounter discrepancies in the OCR information provided in the official dataset. To rectify this, we employ the PaddleOCR toolkit to extract OCR information from all images. Moreover, we observe that certain textual content in the official paper pertains to images that are not relevant for captioning, thereby introducing noise during caption generation. To mitigate this issue, we leverage LLaMA to extract image-specific information by querying the textual content based on image mentions, effectively filtering out extraneous information. Additionally, we recognize a discrepancy between the primary use of maximum likelihood estimation during text generation and the evaluation metrics such as ROUGE employed to assess the quality of generated captions. To bridge this gap, we integrate the BRIO model framework, enabling a more coherent alignment between the generation and evaluation processes. Our approach ranked first in the final test with a score of 4.49.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes