CVAIMay 30, 2025

Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

arXiv:2505.24371v31 citationsh-index: 8Has CodeICIP
Originality Incremental advance
AI Analysis

This addresses video question answering for applications requiring privacy, with incremental improvements in accuracy and localization.

The paper tackles video question answering by proposing Grid-LOGAT, a system that extracts text transcripts from video frames using a vision-language model and processes questions with a large language model, achieving state-of-the-art accuracies of 65.9% on NExT-QA and 50.11% on STAR-QA.

In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively. Additionally, our method surpasses the non-grid version by 24 points on localization-based questions we created using NExT-QA. (This paper is accepted by IEEE ICIP 2025.)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes