IRCLCVLGOct 23, 2025

Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

arXiv:2510.20193v11 citationsh-index: 9Proceedings of the 2nd ACM Workshop in AI-powered Question & Answering Systems
Originality Synthesis-oriented
AI Analysis

It addresses the problem of building robust QA systems for users handling multimedia data, but it is incremental as it is a review paper.

This survey reviews recent advancements in question answering systems that integrate multimedia retrieval pipelines, analyzing architectures, benchmarks, and challenges like cross-modal alignment and latency-accuracy tradeoffs.

Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes